How to Filter Invalid CSV Rows in Python
In data science and backend development, CSV files often contain malformed data, i.e. missing values, incorrect column counts, or wrong data types. Processing this data without validation can lead to application crashes or skewed analysis.
This guide explores how to detect and remove invalid rows using Python's built-in csv module, covering structural validation, type checking, and deduplication.
Understanding Invalid CSV Data
Invalid data typically falls into three categories:
- Structural Errors: A row has more or fewer columns than the header.
- Missing Data: Required fields contain
Noneor empty strings (""). - Type Mismatches: A numeric column (like 'Age') contains text (like "Twenty").
When working with CSVs in Python, always open files with newline='' to ensure consistent line break handling across operating systems.
Method 1: Filtering by Structure (Column Counts)
The most basic validation ensures that every data row has the same number of columns as the header row.
Consider the following data.csv as input for the example below:
Name,Age,City
John Doe,29,New York
Jane Smith,25
,,
Example:
import csv
def filter_structural_errors(input_file):
valid_rows = []
with open(input_file, 'r') as file:
reader = csv.reader(file)
header = next(reader) # Read header
expected_cols = len(header)
valid_rows.append(header)
for row in reader:
# ⛔️ Logic to avoid: Trusting every row blindly
# process(row)
# ✅ Correct: Check length and empty values
if len(row) != expected_cols:
print(f"Skipping malformed row (Column count mismatch): {row}")
continue
if any(not cell.strip() for cell in row):
print(f"Skipping row with empty values: {row}")
continue
valid_rows.append(row)
return valid_rows
filter_structural_errors('data.csv')
Output:
Skipping malformed row (Column count mismatch): ['Jane Smith', '25']
Skipping row with empty values: ['', '', '']
Method 2: Validating Data Types
Once the structure is correct, you must ensure the data inside the columns is valid (e.g., verifying that an ID is an integer or an email follows a pattern).
import re
def is_valid_content(row):
"""
Validates specific columns:
Index 0 (Name): String
Index 1 (Age): Integer
Index 2 (Email): Email Regex
"""
name, age_str, email = row
try:
# 1. Validate Age
age = int(age_str)
if age < 0 or age > 120:
return False
# 2. Validate Email Regex
email_pattern = r"[^@]+@[^@]+\.[^@]+"
if not re.match(email_pattern, email):
return False
return True
except ValueError:
# Age was not a number
return False
# Example usage on a single row
row_good = ["Alice", "30", "alice@example.com"]
row_bad = ["Bob", "Thirty", "bob@example"]
print(f"Row Good: {is_valid_content(row_good)}") # Output: Row Good: True
print(f"Row Bad: {is_valid_content(row_bad)}") # Output: Row Bad: False
Method 3: Removing Duplicate Rows
Duplicates can skew statistical analysis. To remove them, track rows using a Python set. Note that lists are not hashable (cannot be added to a set), so rows must be converted to tuples first.
def remove_duplicates(rows):
unique_rows = set()
cleaned_data = []
for row in rows:
# Convert list to tuple to make it hashable
row_tuple = tuple(row)
if row_tuple not in unique_rows:
unique_rows.add(row_tuple)
cleaned_data.append(row) # Keep the original list format
return cleaned_data
Building a Complete Cleaning Pipeline
Combining these techniques allows you to create a robust script that reads a raw file and produces a clean dataset.
import csv
import re
def clean_csv_dataset(input_path, output_path):
unique_set = set()
with open(input_path, 'r', newline='') as infile, \
open(output_path, 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
# Write Header
header = next(reader, None)
if header:
writer.writerow(header)
expected_cols = len(header)
for row in reader:
# 1. Structural Check
if len(row) != expected_cols:
continue
# 2. Type/Content Check (Custom logic based on your data)
# Assuming Column 1 is Age (must be numeric)
if not row[1].isdigit():
continue
# 3. Deduplication
row_tuple = tuple(row)
if row_tuple in unique_set:
continue
unique_set.add(row_tuple)
# ✅ Write Valid Row
writer.writerow(row)
# Usage
# clean_csv_dataset('raw_data.csv', 'clean_data.csv')
Using set for deduplication stores unique rows in memory. For extremely large datasets (gigabytes), this may consume too much RAM. In such cases, consider using libraries like Pandas or processing the file in chunks.
Conclusion
Filtering invalid CSV rows ensures data integrity for downstream applications.
- Structure First: Always validate column counts (
len(row)) before accessing indices to preventIndexError. - Validate Types: Use
try-exceptblocks to filter out data with incorrect types (e.g., strings in numeric columns). - Deduplicate: Use sets to remove exact duplicates efficiently.