Python PyTorch: How to Use a DataLoader in PyTorch
When training deep learning models, loading an entire dataset into memory at once is often impractical - datasets can be gigabytes in size, and processing them sequentially is slow. PyTorch's DataLoader solves both problems by automatically batching, shuffling, and parallelizing the data loading process.
This guide explains how to create custom datasets, configure DataLoaders, and use them effectively in training loops.
What a DataLoader Does
A DataLoader wraps a dataset and provides an iterable that yields batches of data. Instead of manually slicing your data into batches and shuffling between epochs, the DataLoader handles this automatically:
Full Dataset (10,000 samples)
↓ DataLoader(batch_size=32, shuffle=True)
↓
Epoch 1: [Batch 1: 32 samples] → [Batch 2: 32 samples] → ... → [Batch 313: 8 samples]
Epoch 2: [Batch 1: 32 samples (different order)] → ...
DataLoader Syntax
from torch.utils.data import DataLoader
DataLoader(
dataset,
batch_size=1,
shuffle=False,
num_workers=0,
drop_last=False,
pin_memory=False
)
| Parameter | Description | Default |
|---|---|---|
dataset | The dataset to load (required) | Required |
batch_size | Number of samples per batch | 1 |
shuffle | Whether to randomize order each epoch | False |
num_workers | Number of subprocesses for parallel loading | 0 (main process) |
drop_last | Drop the last incomplete batch if the dataset isn't evenly divisible | False |
pin_memory | Copy tensors to CUDA pinned memory for faster GPU transfer | False |
Creating a Custom Dataset
To use a DataLoader, you first need a dataset. Custom datasets extend torch.utils.data.Dataset and must implement two methods:
__len__()- returns the total number of samples__getitem__(index)- returns a single sample at the given index
import torch
from torch.utils.data import Dataset, DataLoader
class NumberDataset(Dataset):
"""A simple dataset containing numbers 0 to 99."""
def __init__(self):
self.data = list(range(100))
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return self.data[index]
dataset = NumberDataset()
print(f"Dataset size: {len(dataset)}")
print(f"Sample at index 5: {dataset[5]}")
Output:
Dataset size: 100
Sample at index 5: 5
Using the DataLoader
Wrap the dataset in a DataLoader and iterate over it to get batches:
import torch
from torch.utils.data import Dataset, DataLoader
class NumberDataset(Dataset):
def __init__(self):
self.data = list(range(100))
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return self.data[index]
dataset = NumberDataset()
dataloader = DataLoader(dataset, batch_size=10, shuffle=True)
# Print the first 3 batches
for i, batch in enumerate(dataloader):
if i >= 3:
break
print(f"Batch {i}: {batch}")
print(f"\nTotal batches: {len(dataloader)}")
Output (varies due to shuffling):
Batch 0: tensor([56, 84, 42, 4, 66, 27, 99, 18, 20, 89])
Batch 1: tensor([ 7, 30, 74, 57, 10, 6, 28, 77, 0, 50])
Batch 2: tensor([32, 22, 73, 97, 26, 98, 85, 17, 8, 16])
Total batches: 10
The DataLoader automatically divides the 100 samples into 10 batches of 10, shuffled randomly.
Dataset with Features and Labels
Most real-world datasets have both input features and target labels. Return them as a tuple from __getitem__():
import torch
from torch.utils.data import Dataset, DataLoader
class RegressionDataset(Dataset):
"""Simple dataset with input features and target values."""
def __init__(self, num_samples=200):
self.X = torch.randn(num_samples, 3) # 3 features
self.y = self.X.sum(dim=1) + torch.randn(num_samples) * 0.1 # Target with noise
def __len__(self):
return len(self.X)
def __getitem__(self, index):
return self.X[index], self.y[index]
dataset = RegressionDataset(200)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Get one batch
features, targets = next(iter(dataloader))
print(f"Features shape: {features.shape}")
print(f"Targets shape: {targets.shape}")
Output:
Features shape: torch.Size([32, 3])
Targets shape: torch.Size([32])
Each batch contains 32 samples with 3 features each, along with their corresponding target values.
Using DataLoader with Built-in Datasets
PyTorch and related libraries provide many ready-to-use datasets. Here's how to use a DataLoader with TensorDataset:
import torch
from torch.utils.data import DataLoader, TensorDataset
# Create tensors from data
features = torch.randn(150, 4) # 150 samples, 4 features
labels = torch.randint(0, 3, (150,)) # 3 classes
# Wrap in TensorDataset
dataset = TensorDataset(features, labels)
# Create DataLoader
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
for batch_features, batch_labels in dataloader:
print(f"Features: {batch_features.shape}, Labels: {batch_labels.shape}")
break # Just show the first batch
Output:
Features: torch.Size([16, 4]), Labels: torch.Size([16])
TensorDataset is a convenient wrapper when your data is already in tensor form. It automatically pairs corresponding elements from multiple tensors when batching.
Using DataLoader in a Training Loop
Here's how DataLoaders are typically used in a model training loop:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
# Create a simple dataset
X = torch.randn(500, 10)
y = torch.randint(0, 2, (500,)).float()
dataset = TensorDataset(X, y)
# Split into train and validation sets
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])
# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
# Simple model
model = nn.Linear(10, 1)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Training loop
for epoch in range(3):
model.train()
total_loss = 0
for batch_X, batch_y in train_loader:
optimizer.zero_grad()
predictions = model(batch_X).squeeze()
loss = criterion(predictions, batch_y)
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
print(f"Epoch {epoch + 1}, Average Loss: {avg_loss:.4f}")
Output:
Epoch 1, Average Loss: 0.7414
Epoch 2, Average Loss: 0.7155
Epoch 3, Average Loss: 0.7027
Key DataLoader Configuration Options
Shuffling
Always shuffle training data to prevent the model from learning the order of samples:
# Training: shuffle to randomize order each epoch
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# Validation/Testing: no need to shuffle
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
Parallel Data Loading with num_workers
Speed up data loading by using multiple subprocesses:
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)
On Windows, multi-worker DataLoaders must be created inside an if __name__ == '__main__': block to avoid spawning errors. On macOS, you may also need to set the multiprocessing start method: torch.multiprocessing.set_start_method('fork').
Dropping the Last Incomplete Batch
If your dataset size isn't evenly divisible by the batch size, the last batch will be smaller. Use drop_last=True to discard it:
import torch
from torch.utils.data import Dataset, DataLoader
class NumberDataset(Dataset):
def __init__(self):
self.data = list(range(100))
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return self.data[index]
dataset = NumberDataset() # 100 samples
dataloader = DataLoader(dataset, batch_size=32, drop_last=True)
print(f"Batches: {len(dataloader)}") # 3 batches of 32, last 4 samples dropped
Output:
Batches: 3
GPU Memory Optimization with pin_memory
When training on GPU, enable pin_memory for faster host-to-device transfers:
dataloader = DataLoader(dataset, batch_size=32, pin_memory=True)
for batch_X, batch_y in dataloader:
batch_X = batch_X.to('cuda', non_blocking=True)
batch_y = batch_y.to('cuda', non_blocking=True)
Common Mistake: Forgetting to Set shuffle=True for Training
Training without shuffling can cause the model to learn patterns from the data ordering rather than the data itself:
# WRONG: no shuffling during training, model may learn order patterns
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=False)
# CORRECT: always shuffle training data
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
Always set shuffle=True for training DataLoaders. For validation and test DataLoaders, shuffling is unnecessary since you're only evaluating, not learning from the data order.
Quick Reference
| Configuration | Training | Validation/Testing |
|---|---|---|
shuffle | True | False |
batch_size | 16–128 (experiment) | Same or larger |
num_workers | 2–8 (depends on CPU) | Same as training |
drop_last | Optional (True for BatchNorm stability) | False |
pin_memory | True (if using GPU) | True (if using GPU) |
The DataLoader is the backbone of efficient data handling in PyTorch. By configuring batch size, shuffling, and parallel workers, you can significantly speed up training while keeping memory usage under control. Combined with a well-structured Dataset class, it provides a clean and scalable pipeline for feeding data to your models.