How to Filter Invalid CSV Rows in Python

In data science and backend development, CSV files often contain malformed data, i.e. missing values, incorrect column counts, or wrong data types. Processing this data without validation can lead to application crashes or skewed analysis.

This guide explores how to detect and remove invalid rows using Python's built-in csv module, covering structural validation, type checking, and deduplication.

Understanding Invalid CSV Data

Invalid data typically falls into three categories:

Structural Errors: A row has more or fewer columns than the header.
Missing Data: Required fields contain None or empty strings ("").
Type Mismatches: A numeric column (like 'Age') contains text (like "Twenty").

note

When working with CSVs in Python, always open files with newline='' to ensure consistent line break handling across operating systems.

Method 1: Filtering by Structure (Column Counts)

The most basic validation ensures that every data row has the same number of columns as the header row.

Consider the following data.csv as input for the example below:

data.csv
Name,Age,City
John Doe,29,New York
Jane Smith,25
,,

Example:

import csv

def filter_structural_errors(input_file):
    valid_rows = []
    
    with open(input_file, 'r') as file:
        reader = csv.reader(file)
        header = next(reader) # Read header
        expected_cols = len(header)
        
        valid_rows.append(header)

        for row in reader:
            # ⛔️ Logic to avoid: Trusting every row blindly
            # process(row) 
            
            # ✅ Correct: Check length and empty values
            if len(row) != expected_cols:
                print(f"Skipping malformed row (Column count mismatch): {row}")
                continue
                
            if any(not cell.strip() for cell in row):
                print(f"Skipping row with empty values: {row}")
                continue

            valid_rows.append(row)
            
    return valid_rows


filter_structural_errors('data.csv')

Output:

Skipping malformed row (Column count mismatch): ['Jane Smith', '25']
Skipping row with empty values: ['', '', '']

Method 2: Validating Data Types

Once the structure is correct, you must ensure the data inside the columns is valid (e.g., verifying that an ID is an integer or an email follows a pattern).

import re

def is_valid_content(row):
    """
    Validates specific columns:
    Index 0 (Name): String
    Index 1 (Age): Integer
    Index 2 (Email): Email Regex
    """
    name, age_str, email = row
    
    try:
        # 1. Validate Age
        age = int(age_str)
        if age < 0 or age > 120:
            return False
            
        # 2. Validate Email Regex
        email_pattern = r"[^@]+@[^@]+\.[^@]+"
        if not re.match(email_pattern, email):
            return False
            
        return True
    except ValueError:
        # Age was not a number
        return False

# Example usage on a single row
row_good = ["Alice", "30", "alice@example.com"]
row_bad  = ["Bob", "Thirty", "bob@example"]

print(f"Row Good: {is_valid_content(row_good)}") # Output: Row Good: True
print(f"Row Bad:  {is_valid_content(row_bad)}")  # Output: Row Bad:  False

Method 3: Removing Duplicate Rows

Duplicates can skew statistical analysis. To remove them, track rows using a Python set. Note that lists are not hashable (cannot be added to a set), so rows must be converted to tuples first.

def remove_duplicates(rows):
    unique_rows = set()
    cleaned_data = []

    for row in rows:
        # Convert list to tuple to make it hashable
        row_tuple = tuple(row)
        
        if row_tuple not in unique_rows:
            unique_rows.add(row_tuple)
            cleaned_data.append(row) # Keep the original list format
            
    return cleaned_data

Building a Complete Cleaning Pipeline

Combining these techniques allows you to create a robust script that reads a raw file and produces a clean dataset.

import csv
import re

def clean_csv_dataset(input_path, output_path):
    unique_set = set()
    
    with open(input_path, 'r', newline='') as infile, \
         open(output_path, 'w', newline='') as outfile:
        
        reader = csv.reader(infile)
        writer = csv.writer(outfile)
        
        # Write Header
        header = next(reader, None)
        if header:
            writer.writerow(header)
            expected_cols = len(header)
        
        for row in reader:
            # 1. Structural Check
            if len(row) != expected_cols:
                continue
                
            # 2. Type/Content Check (Custom logic based on your data)
            # Assuming Column 1 is Age (must be numeric)
            if not row[1].isdigit():
                continue
                
            # 3. Deduplication
            row_tuple = tuple(row)
            if row_tuple in unique_set:
                continue
            unique_set.add(row_tuple)
            
            # ✅ Write Valid Row
            writer.writerow(row)

# Usage
# clean_csv_dataset('raw_data.csv', 'clean_data.csv')

warning

Using set for deduplication stores unique rows in memory. For extremely large datasets (gigabytes), this may consume too much RAM. In such cases, consider using libraries like Pandas or processing the file in chunks.

Conclusion

Filtering invalid CSV rows ensures data integrity for downstream applications.

Structure First: Always validate column counts (len(row)) before accessing indices to prevent IndexError.
Validate Types: Use try-except blocks to filter out data with incorrect types (e.g., strings in numeric columns).
Deduplicate: Use sets to remove exact duplicates efficiently.

Understanding Invalid CSV Data​

Method 1: Filtering by Structure (Column Counts)​

Method 2: Validating Data Types​

Method 3: Removing Duplicate Rows​

Building a Complete Cleaning Pipeline​

Conclusion​

Table of Contents