Skip to main content

How to Filter Invalid CSV Rows in Python

In data science and backend development, CSV files often contain malformed data, i.e. missing values, incorrect column counts, or wrong data types. Processing this data without validation can lead to application crashes or skewed analysis.

This guide explores how to detect and remove invalid rows using Python's built-in csv module, covering structural validation, type checking, and deduplication.

Understanding Invalid CSV Data

Invalid data typically falls into three categories:

  1. Structural Errors: A row has more or fewer columns than the header.
  2. Missing Data: Required fields contain None or empty strings ("").
  3. Type Mismatches: A numeric column (like 'Age') contains text (like "Twenty").
note

When working with CSVs in Python, always open files with newline='' to ensure consistent line break handling across operating systems.

Method 1: Filtering by Structure (Column Counts)

The most basic validation ensures that every data row has the same number of columns as the header row.

Consider the following data.csv as input for the example below:

data.csv
Name,Age,City
John Doe,29,New York
Jane Smith,25
,,

Example:

import csv

def filter_structural_errors(input_file):
valid_rows = []

with open(input_file, 'r') as file:
reader = csv.reader(file)
header = next(reader) # Read header
expected_cols = len(header)

valid_rows.append(header)

for row in reader:
# ⛔️ Logic to avoid: Trusting every row blindly
# process(row)

# ✅ Correct: Check length and empty values
if len(row) != expected_cols:
print(f"Skipping malformed row (Column count mismatch): {row}")
continue

if any(not cell.strip() for cell in row):
print(f"Skipping row with empty values: {row}")
continue

valid_rows.append(row)

return valid_rows


filter_structural_errors('data.csv')

Output:

Skipping malformed row (Column count mismatch): ['Jane Smith', '25']
Skipping row with empty values: ['', '', '']

Method 2: Validating Data Types

Once the structure is correct, you must ensure the data inside the columns is valid (e.g., verifying that an ID is an integer or an email follows a pattern).

import re

def is_valid_content(row):
"""
Validates specific columns:
Index 0 (Name): String
Index 1 (Age): Integer
Index 2 (Email): Email Regex
"""
name, age_str, email = row

try:
# 1. Validate Age
age = int(age_str)
if age < 0 or age > 120:
return False

# 2. Validate Email Regex
email_pattern = r"[^@]+@[^@]+\.[^@]+"
if not re.match(email_pattern, email):
return False

return True
except ValueError:
# Age was not a number
return False

# Example usage on a single row
row_good = ["Alice", "30", "alice@example.com"]
row_bad = ["Bob", "Thirty", "bob@example"]

print(f"Row Good: {is_valid_content(row_good)}") # Output: Row Good: True
print(f"Row Bad: {is_valid_content(row_bad)}") # Output: Row Bad: False

Method 3: Removing Duplicate Rows

Duplicates can skew statistical analysis. To remove them, track rows using a Python set. Note that lists are not hashable (cannot be added to a set), so rows must be converted to tuples first.

def remove_duplicates(rows):
unique_rows = set()
cleaned_data = []

for row in rows:
# Convert list to tuple to make it hashable
row_tuple = tuple(row)

if row_tuple not in unique_rows:
unique_rows.add(row_tuple)
cleaned_data.append(row) # Keep the original list format

return cleaned_data

Building a Complete Cleaning Pipeline

Combining these techniques allows you to create a robust script that reads a raw file and produces a clean dataset.

import csv
import re

def clean_csv_dataset(input_path, output_path):
unique_set = set()

with open(input_path, 'r', newline='') as infile, \
open(output_path, 'w', newline='') as outfile:

reader = csv.reader(infile)
writer = csv.writer(outfile)

# Write Header
header = next(reader, None)
if header:
writer.writerow(header)
expected_cols = len(header)

for row in reader:
# 1. Structural Check
if len(row) != expected_cols:
continue

# 2. Type/Content Check (Custom logic based on your data)
# Assuming Column 1 is Age (must be numeric)
if not row[1].isdigit():
continue

# 3. Deduplication
row_tuple = tuple(row)
if row_tuple in unique_set:
continue
unique_set.add(row_tuple)

# ✅ Write Valid Row
writer.writerow(row)

# Usage
# clean_csv_dataset('raw_data.csv', 'clean_data.csv')
warning

Using set for deduplication stores unique rows in memory. For extremely large datasets (gigabytes), this may consume too much RAM. In such cases, consider using libraries like Pandas or processing the file in chunks.

Conclusion

Filtering invalid CSV rows ensures data integrity for downstream applications.

  1. Structure First: Always validate column counts (len(row)) before accessing indices to prevent IndexError.
  2. Validate Types: Use try-except blocks to filter out data with incorrect types (e.g., strings in numeric columns).
  3. Deduplicate: Use sets to remove exact duplicates efficiently.