How to Load NumPy Data in TensorFlow Using Python
When building machine learning models with TensorFlow, your training data often starts as NumPy arrays: whether loaded from files, generated programmatically, or preprocessed with libraries like Pandas or scikit-learn. TensorFlow provides seamless integration with NumPy through its tf.data.Dataset API, allowing you to convert NumPy arrays into efficient, iterable dataset pipelines ready for model training.
In this guide, you will learn how to load NumPy data into TensorFlow using tf.data.Dataset.from_tensor_slices(), work with different array shapes, pair features with labels, and apply common dataset operations like batching and shuffling.
Using tf.data.Dataset.from_tensor_slices()
The primary method for loading NumPy data into TensorFlow is tf.data.Dataset.from_tensor_slices(). This function takes a NumPy array (or a tuple/dictionary of arrays) and creates a Dataset object where each element corresponds to a slice along the first dimension of the input.
Syntax
tf.data.Dataset.from_tensor_slices(tensors)
tensors: A NumPy array, a Python list, a TensorFlow tensor, or a tuple/dictionary of these types.- Returns: A
tf.data.Datasetobject that yields individual slices.
Loading a 2D NumPy Array
Each row of the array becomes a separate element in the dataset:
import tensorflow as tf
import numpy as np
# Create a 2D NumPy array
arr = np.array([
[1, 2, 3, 4],
[4, 5, 6, 0],
[2, 0, 7, 8],
[3, 7, 4, 2]
])
# Load into a TensorFlow Dataset
dataset = tf.data.Dataset.from_tensor_slices(arr)
# Iterate and print each element
for element in dataset:
print(element.numpy())
Output:
[1 2 3 4]
[4 5 6 0]
[2 0 7 8]
[3 7 4 2]
The 4×4 array is sliced along the first axis (rows), producing 4 dataset elements, each a 1D array of length 4.
Loading a Python List
from_tensor_slices() also works directly with Python lists, which are internally converted to tensors:
import tensorflow as tf
data = [[5, 10], [3, 6], [1, 2], [5, 0]]
dataset = tf.data.Dataset.from_tensor_slices(data)
for element in dataset:
print(element.numpy())
Output:
[ 5 10]
[3 6]
[1 2]
[5 0]
Loading Features and Labels Together
In machine learning, you typically have a features array and a corresponding labels array. Pass them as a tuple to create a paired dataset:
import tensorflow as tf
import numpy as np
# Feature data: 5 samples, 3 features each
features = np.array([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
[10.0, 11.0, 12.0],
[13.0, 14.0, 15.0]
])
# Labels: one per sample
labels = np.array([0, 1, 0, 1, 1])
# Create a paired dataset
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
for feature, label in dataset:
print(f"Features: {feature.numpy()}, Label: {label.numpy()}")
Output:
Features: [1. 2. 3.], Label: 0
Features: [4. 5. 6.], Label: 1
Features: [7. 8. 9.], Label: 0
Features: [10. 11. 12.], Label: 1
Features: [13. 14. 15.], Label: 1
Each iteration yields a (feature_vector, label) pair: exactly the format that model.fit() expects.
Loading Data as a Dictionary
You can also pass a dictionary of arrays. This is useful when your model expects named inputs:
import tensorflow as tf
import numpy as np
data = {
"temperature": np.array([22.5, 25.0, 19.8, 30.2]),
"humidity": np.array([45, 60, 80, 35]),
"label": np.array([0, 1, 1, 0])
}
dataset = tf.data.Dataset.from_tensor_slices(data)
for sample in dataset:
print({key: val.numpy() for key, val in sample.items()})
Output:
{'temperature': 22.5, 'humidity': 45, 'label': 0}
{'temperature': 25.0, 'humidity': 60, 'label': 1}
{'temperature': 19.8, 'humidity': 80, 'label': 1}
{'temperature': 30.2, 'humidity': 35, 'label': 0}
Applying Dataset Operations
Once your data is in a tf.data.Dataset, you can chain operations to prepare it for training.
Shuffling, Batching, and Prefetching
import tensorflow as tf
import numpy as np
features = np.random.randn(100, 4).astype(np.float32)
labels = np.random.randint(0, 2, size=100).astype(np.int32)
# Create dataset pipeline
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
# Shuffle, batch, and prefetch for optimal training performance
dataset = dataset.shuffle(buffer_size=100) # Randomize order
dataset = dataset.batch(16) # Group into batches of 16
dataset = dataset.prefetch(tf.data.AUTOTUNE) # Overlap data loading with training
# Inspect one batch
for batch_features, batch_labels in dataset.take(1):
print(f"Batch features shape: {batch_features.shape}")
print(f"Batch labels shape: {batch_labels.shape}")
Output:
Batch features shape: (16, 4)
Batch labels shape: (16,)
- Shuffle before batching to ensure each batch has a diverse mix of samples.
- Batch to group samples for efficient GPU utilization.
- Prefetch with
tf.data.AUTOTUNEto overlap data preprocessing with model training, reducing idle time.
# ✅ Recommended pipeline order
dataset = (
tf.data.Dataset.from_tensor_slices((features, labels))
.shuffle(buffer_size=len(features))
.batch(32)
.prefetch(tf.data.AUTOTUNE)
)
Applying Transformations with .map()
Use .map() to apply preprocessing functions to each element:
import tensorflow as tf
import numpy as np
features = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
labels = np.array([0, 1, 0])
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
# Normalize features by dividing by the maximum value
def normalize(feature, label):
return feature / 6.0, label
dataset = dataset.map(normalize)
for feature, label in dataset:
print(f"Normalized: {feature.numpy()}, Label: {label.numpy()}")
Output:
Normalized: [0.16666667 0.33333334], Label: 0
Normalized: [0.5 0.6666667 ], Label: 1
Normalized: [0.8333333 1. ], Label: 0
Using the Dataset for Model Training
Here is a complete example showing how to load NumPy data into TensorFlow and train a simple model:
import tensorflow as tf
import numpy as np
# Generate sample data
np.random.seed(42)
X_train = np.random.randn(1000, 10).astype(np.float32)
y_train = np.random.randint(0, 2, size=1000).astype(np.int32)
# Create dataset pipeline
train_dataset = (
tf.data.Dataset.from_tensor_slices((X_train, y_train))
.shuffle(1000)
.batch(32)
.prefetch(tf.data.AUTOTUNE)
)
# Build a simple model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train using the dataset
model.fit(train_dataset, epochs=3)
Output:
Epoch 1/3
32/32 [==============================] - 1s 2ms/step - loss: 0.7012 - accuracy: 0.5010
Epoch 2/3
32/32 [==============================] - 0s 2ms/step - loss: 0.6920 - accuracy: 0.5250
Epoch 3/3
32/32 [==============================] - 0s 2ms/step - loss: 0.6889 - accuracy: 0.5370
Common Mistakes and How to Avoid Them
Mistake 1: Mismatched Array Lengths in Tuples
# ❌ Features has 5 rows but labels has 4 elements
features = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
labels = np.array([0, 1, 0, 1])
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
# ValueError: all input arrays must have the same first dimension
Fix: Ensure all arrays in the tuple have the same length along the first axis:
# ✅ Both have 5 elements
labels = np.array([0, 1, 0, 1, 1])
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
Mistake 2: Forgetting to Batch Before Training
# ❌ Unbatched dataset: model.fit() will process one sample at a time
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
model.fit(dataset, epochs=5) # Extremely slow
Fix: Always batch your dataset:
# ✅ Batched for efficient training
dataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(32)
model.fit(dataset, epochs=5)
Using an unbatched dataset with model.fit() technically works, but it processes one sample per step, which is orders of magnitude slower than batched training and does not leverage GPU parallelism.
Conclusion
Loading NumPy data into TensorFlow is simple and efficient using tf.data.Dataset.from_tensor_slices(). This function handles 1D arrays, 2D matrices, tuples of feature-label pairs, and dictionaries of named inputs.
Once your data is in a tf.data.Dataset, you can chain operations like shuffle, batch, map, and prefetch to build optimized training pipelines.
This approach is the recommended way to feed NumPy data into TensorFlow models, offering better performance and memory management than passing raw arrays directly to model.fit().