Skip to main content

Python TensorFlow: How to Load Text Data in TensorFlow Using Python

Text data is everywhere: documentation, social media posts, blog articles, customer reviews, and support tickets. Before using text in machine learning models, you need to load and preprocess it efficiently. TensorFlow provides built-in utilities that simplify loading text datasets from directories, splitting them into training and validation sets, and preparing them for model training, all in just a few lines of code.

In this guide, you will learn how to download a text dataset, load it into TensorFlow using text_dataset_from_directory(), inspect the data, and prepare it for training a text classification model.

Setting Up and Downloading the Dataset

For this guide, we will use the Stack Overflow 16k dataset, which contains programming questions categorized by language (Java, Python, C#, and JavaScript). TensorFlow's Keras API provides a convenient method to download and extract the dataset:

import tensorflow as tf
import tensorflow.keras as keras
import pathlib

# Download and extract the dataset
url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

download = keras.utils.get_file(
origin=url,
untar=True,
cache_dir='stack_overflow'
)

DATA_DIR = pathlib.Path(download).parent
print("Dataset contents:", pathlib.os.listdir(DATA_DIR))
print("Training categories:", pathlib.os.listdir(f"{DATA_DIR}/train"))

Output:

Dataset contents: ['train', 'stack_overflow_16k.tar.gz', 'test', 'README.md']
Training categories: ['java', 'python', 'csharp', 'javascript']

The dataset is organized into a standard directory structure where each subdirectory represents a class label:

stack_overflow/
├── train/
│ ├── java/ (2000 text files)
│ ├── python/ (2000 text files)
│ ├── csharp/ (2000 text files)
│ └── javascript/ (2000 text files)
├── test/
│ ├── java/
│ ├── python/
│ ├── csharp/
│ └── javascript/
└── README.md
info

keras.utils.get_file() automatically caches the download, so running it again will skip the download step and use the local copy.

Inspecting the Dataset Structure

Before loading the data, verify the number of text files per category:

TRAIN_DIR = f"{DATA_DIR}/train"
TEST_DIR = f"{DATA_DIR}/test"

for category in pathlib.os.listdir(TRAIN_DIR):
count = len(pathlib.os.listdir(f"{TRAIN_DIR}/{category}"))
print(f"{category}: {count} text files")

Output:

java: 2000 text files
python: 2000 text files
csharp: 2000 text files
javascript: 2000 text files

The training set contains 8,000 text files evenly distributed across 4 categories.

Loading Text With text_dataset_from_directory()

TensorFlow's keras.utils.text_dataset_from_directory() is the primary function for loading text data organized in directories. It automatically:

  • Reads text files from the specified directory.
  • Assigns labels based on subdirectory names.
  • Batches the data for efficient processing.
  • Splits the data into training and validation subsets.

Creating Training and Validation Datasets

# Load 80% of the data for training
training_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR,
batch_size=32,
validation_split=0.2,
subset='training',
seed=42
)

# Load 20% of the data for validation
validation_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR,
batch_size=32,
validation_split=0.2,
subset='validation',
seed=42
)

Output:

Found 8000 files belonging to 4 classes.
Using 6400 files for training.

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.

Key Parameters Explained

ParameterDescription
batch_sizeNumber of samples per batch (default: 32)
validation_splitFraction of data to reserve for validation (e.g., 0.2 = 20%)
subsetWhich subset to return: 'training' or 'validation'
seedRandom seed to ensure the training/validation split is consistent across both calls
Always use the same seed for training and validation splits

If you use different seeds (or omit the seed), the training and validation sets may overlap, leading to data leakage and unreliable model evaluation:

# ❌ Different seeds: splits may overlap
training_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, validation_split=0.2, subset='training', seed=42)

validation_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, validation_split=0.2, subset='validation', seed=99)

# ✅ Same seed: splits are complementary and non-overlapping
training_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, validation_split=0.2, subset='training', seed=42)

validation_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, validation_split=0.2, subset='validation', seed=42)

Inspecting the Loaded Data

Viewing Class Names

The dataset automatically maps subdirectory names to integer labels:

print("Class names:", training_data.class_names)

Output:

Class names: ['csharp', 'java', 'javascript', 'python']

Examining Sample Batches

Each batch contains text tensors paired with integer labels:

for texts, labels in training_data.take(1):
print(f"Batch shape: {texts.shape}")
print(f"Labels shape: {labels.shape}")
print(f"\nFirst 3 labels: {labels[:3].numpy()}")
print(f"Label names: {[training_data.class_names[l] for l in labels[:3].numpy()]}")
print(f"\nSample text (truncated):\n{texts[0].numpy()[:200]}...")

Output:

Batch shape: (32,)
Labels shape: (32,)

First 3 labels: [2 3 0]
Label names: ['javascript', 'python', 'csharp']

Sample text (truncated):
b"how to pass a value from one function to another function in javascript ...

Loading Test Data

The test set is loaded separately without a validation split:

test_data = keras.utils.text_dataset_from_directory(
TEST_DIR,
batch_size=32
)

Output:

Found 8000 files belonging to 4 classes.

Optimizing the Dataset Pipeline

For better training performance, apply caching and prefetching to avoid I/O bottlenecks:

AUTOTUNE = tf.data.AUTOTUNE

training_data = training_data.cache().prefetch(buffer_size=AUTOTUNE)
validation_data = validation_data.cache().prefetch(buffer_size=AUTOTUNE)
test_data = test_data.cache().prefetch(buffer_size=AUTOTUNE)
What cache() and prefetch() do
  • cache(): Stores the dataset in memory after the first epoch, so subsequent epochs read from memory instead of disk.
  • prefetch(AUTOTUNE): Prepares the next batch while the current batch is being processed by the model, overlapping data loading with training.

These two operations together can significantly reduce training time, especially with large text datasets.

Putting It All Together: A Complete Example

Here is a complete workflow from downloading the data to preparing it for model training:

import tensorflow as tf
import tensorflow.keras as keras
import pathlib

# Step 1: Download and extract the dataset
url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
download = keras.utils.get_file(origin=url, untar=True, cache_dir='stack_overflow')
DATA_DIR = pathlib.Path(download).parent

TRAIN_DIR = f"{DATA_DIR}/train"
TEST_DIR = f"{DATA_DIR}/test"

# Step 2: Load training and validation data
training_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, batch_size=32, validation_split=0.2,
subset='training', seed=42
)

validation_data = keras.utils.text_dataset_from_directory(
TRAIN_DIR, batch_size=32, validation_split=0.2,
subset='validation', seed=42
)

# Step 3: Load test data
test_data = keras.utils.text_dataset_from_directory(
TEST_DIR, batch_size=32
)

# Step 4: Inspect
print(f"Classes: {training_data.class_names}")
print(f"Training batches: {tf.data.experimental.cardinality(training_data).numpy()}")
print(f"Validation batches: {tf.data.experimental.cardinality(validation_data).numpy()}")

# Step 5: Optimize pipeline
AUTOTUNE = tf.data.AUTOTUNE
training_data = training_data.cache().prefetch(buffer_size=AUTOTUNE)
validation_data = validation_data.cache().prefetch(buffer_size=AUTOTUNE)
test_data = test_data.cache().prefetch(buffer_size=AUTOTUNE)

print("\nDatasets are ready for model training.")

Output:

Found 8000 files belonging to 4 classes.
Using 6400 files for training.
Found 8000 files belonging to 4 classes.
Using 1600 files for validation.
Found 8000 files belonging to 4 classes.
Classes: ['csharp', 'java', 'javascript', 'python']
Training batches: 200
Validation batches: 50

Datasets are ready for model training.

Conclusion

TensorFlow makes loading and preprocessing text data straightforward through keras.utils.text_dataset_from_directory().

This single function handles file reading, label assignment, batching, and train/validation splitting: all based on your directory structure.

Combined with pipeline optimizations like cache() and prefetch(), you can build efficient text data pipelines that keep your GPU fed during training.

This approach works for any text classification task where your data is organized into labeled subdirectories, from sentiment analysis to topic classification and language detection.