Python TensorFlow: How to Load Text Data in TensorFlow Using Python
Text data is everywhere: documentation, social media posts, blog articles, customer reviews, and support tickets. Before using text in machine learning models, you need to load and preprocess it efficiently. TensorFlow provides built-in utilities that simplify loading text datasets from directories, splitting them into training and validation sets, and preparing them for model training, all in just a few lines of code.
In this guide, you will learn how to download a text dataset, load it into TensorFlow using text_dataset_from_directory(), inspect the data, and prepare it for training a text classification model.
Setting Up and Downloading the Dataset
For this guide, we will use the Stack Overflow 16k dataset, which contains programming questions categorized by language (Java, Python, C#, and JavaScript). TensorFlow's Keras API provides a convenient method to download and extract the dataset:
import tensorflow as tf
import tensorflow.keras as keras
import pathlib
# Download and extract the dataset
url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
download = keras.utils.get_file(
origin=url,
untar=True,
cache_dir='stack_overflow'
)
DATA_DIR = pathlib.Path(download).parent
print("Dataset contents:", pathlib.os.listdir(DATA_DIR))
print("Training categories:", pathlib.os.listdir(f"{DATA_DIR}/train"))
Output:
Dataset contents: ['train', 'stack_overflow_16k.tar.gz', 'test', 'README.md']
Training categories: ['java', 'python', 'csharp', 'javascript']
The dataset is organized into a standard directory structure where each subdirectory represents a class label:
stack_overflow/
├── train/
│ ├── java/ (2000 text files)
│ ├── python/ (2000 text files)
│ ├── csharp/ (2000 text files)
│ └── javascript/ (2000 text files)
├── test/
│ ├── java/
│ ├── python/
│ ├── csharp/
│ └── javascript/
└── README.md
keras.utils.get_file() automatically caches the download, so running it again will skip the download step and use the local copy.
Inspecting the Dataset Structure
Before loading the data, verify the number of text files per category:
TRAIN_DIR = f"{DATA_DIR}/train"
TEST_DIR = f"{DATA_DIR}/test"
for category in pathlib.os.listdir(TRAIN_DIR):
count = len(pathlib.os.listdir(f"{TRAIN_DIR}/{category}"))
print(f"{category}: {count} text files")
Output:
java: 2000 text files
python: 2000 text files
csharp: 2000 text files
javascript: 2000 text files
The training set contains 8,000 text files evenly distributed across 4 categories.
Loading Text With text_dataset_from_directory()
TensorFlow's keras.utils.text_dataset_from_directory() is the primary function for loading text data organized in directories. It automatically:
- Reads text files from the specified directory.
- Assigns labels based on subdirectory names.
- Batches the data for efficient processing.
- Splits the data into training and validation subsets.
Creating Training and Validation Datasets
# Load 80% of the data for training
training_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR,
batch_size=32,
validation_split=0.2,
subset='training',
seed=42
)
# Load 20% of the data for validation
validation_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR,
batch_size=32,
validation_split=0.2,
subset='validation',
seed=42
)
Output:
Found 8000 files belonging to 4 classes.
Using 6400 files for training.
Found 8000 files belonging to 4 classes.
Using 1600 files for validation.
Key Parameters Explained
| Parameter | Description |
|---|---|
batch_size | Number of samples per batch (default: 32) |
validation_split | Fraction of data to reserve for validation (e.g., 0.2 = 20%) |
subset | Which subset to return: 'training' or 'validation' |
seed | Random seed to ensure the training/validation split is consistent across both calls |
seed for training and validation splitsIf you use different seeds (or omit the seed), the training and validation sets may overlap, leading to data leakage and unreliable model evaluation:
# ❌ Different seeds: splits may overlap
training_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, validation_split=0.2, subset='training', seed=42)
validation_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, validation_split=0.2, subset='validation', seed=99)
# ✅ Same seed: splits are complementary and non-overlapping
training_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, validation_split=0.2, subset='training', seed=42)
validation_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, validation_split=0.2, subset='validation', seed=42)
Inspecting the Loaded Data
Viewing Class Names
The dataset automatically maps subdirectory names to integer labels:
print("Class names:", training_data.class_names)
Output:
Class names: ['csharp', 'java', 'javascript', 'python']
Examining Sample Batches
Each batch contains text tensors paired with integer labels:
for texts, labels in training_data.take(1):
print(f"Batch shape: {texts.shape}")
print(f"Labels shape: {labels.shape}")
print(f"\nFirst 3 labels: {labels[:3].numpy()}")
print(f"Label names: {[training_data.class_names[l] for l in labels[:3].numpy()]}")
print(f"\nSample text (truncated):\n{texts[0].numpy()[:200]}...")
Output:
Batch shape: (32,)
Labels shape: (32,)
First 3 labels: [2 3 0]
Label names: ['javascript', 'python', 'csharp']
Sample text (truncated):
b"how to pass a value from one function to another function in javascript ...
Loading Test Data
The test set is loaded separately without a validation split:
test_data = keras.utils.text_dataset_from_directory(
TEST_DIR,
batch_size=32
)
Output:
Found 8000 files belonging to 4 classes.
Optimizing the Dataset Pipeline
For better training performance, apply caching and prefetching to avoid I/O bottlenecks:
AUTOTUNE = tf.data.AUTOTUNE
training_data = training_data.cache().prefetch(buffer_size=AUTOTUNE)
validation_data = validation_data.cache().prefetch(buffer_size=AUTOTUNE)
test_data = test_data.cache().prefetch(buffer_size=AUTOTUNE)
cache() and prefetch() docache(): Stores the dataset in memory after the first epoch, so subsequent epochs read from memory instead of disk.prefetch(AUTOTUNE): Prepares the next batch while the current batch is being processed by the model, overlapping data loading with training.
These two operations together can significantly reduce training time, especially with large text datasets.
Putting It All Together: A Complete Example
Here is a complete workflow from downloading the data to preparing it for model training:
import tensorflow as tf
import tensorflow.keras as keras
import pathlib
# Step 1: Download and extract the dataset
url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
download = keras.utils.get_file(origin=url, untar=True, cache_dir='stack_overflow')
DATA_DIR = pathlib.Path(download).parent
TRAIN_DIR = f"{DATA_DIR}/train"
TEST_DIR = f"{DATA_DIR}/test"
# Step 2: Load training and validation data
training_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, batch_size=32, validation_split=0.2,
subset='training', seed=42
)
validation_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, batch_size=32, validation_split=0.2,
subset='validation', seed=42
)
# Step 3: Load test data
test_data = keras.utils.text_dataset_from_directory(
TEST_DIR, batch_size=32
)
# Step 4: Inspect
print(f"Classes: {training_data.class_names}")
print(f"Training batches: {tf.data.experimental.cardinality(training_data).numpy()}")
print(f"Validation batches: {tf.data.experimental.cardinality(validation_data).numpy()}")
# Step 5: Optimize pipeline
AUTOTUNE = tf.data.AUTOTUNE
training_data = training_data.cache().prefetch(buffer_size=AUTOTUNE)
validation_data = validation_data.cache().prefetch(buffer_size=AUTOTUNE)
test_data = test_data.cache().prefetch(buffer_size=AUTOTUNE)
print("\nDatasets are ready for model training.")
Output:
Found 8000 files belonging to 4 classes.
Using 6400 files for training.
Found 8000 files belonging to 4 classes.
Using 1600 files for validation.
Found 8000 files belonging to 4 classes.
Classes: ['csharp', 'java', 'javascript', 'python']
Training batches: 200
Validation batches: 50
Datasets are ready for model training.
Conclusion
TensorFlow makes loading and preprocessing text data straightforward through keras.utils.text_dataset_from_directory().
This single function handles file reading, label assignment, batching, and train/validation splitting: all based on your directory structure.
Combined with pipeline optimizations like cache() and prefetch(), you can build efficient text data pipelines that keep your GPU fed during training.
This approach works for any text classification task where your data is organized into labeled subdirectories, from sentiment analysis to topic classification and language detection.