Python Pandas: How to Select DataFrame Columns Starting With a Specific String
When working with Pandas DataFrames, especially those with many columns or dynamically generated names, you often need to select a subset of columns based on a naming pattern. A common requirement is to select all columns whose names begin with a particular prefix or string.
This guide explains several effective methods to select Pandas DataFrame columns that start with a given string, using list comprehensions, Index.str.startswith(), and DataFrame.filter(regex=...).
The Goal: Selecting Columns by Prefix
Given a Pandas DataFrame, we want to create a new DataFrame (or a view) that contains only those columns whose names (headers) begin with a specified string pattern. For example, selecting all columns that start with "Sales_" or "User_".
Example DataFrame
import pandas as pd
data = {
'CustomerID': [101, 102, 103, 104],
'CustomerName': ['Alice WidgetCorp', 'Bob Solutions Ltd', 'Charlie Innovations', 'Diana Services Inc'],
'Region': ['North', 'South', 'North', 'West'],
'Sales_Q1': [1500, 2200, 1800, 2500],
'Sales_Q2': [1700, 2000, 1900, 2800],
'Support_Tickets': [2, 1, 3, 1],
'sales_rep': ['John', 'Jane', 'Mike', 'Anna'] # Lowercase 'sales'
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
CustomerID CustomerName Region Sales_Q1 Sales_Q2 \
0 101 Alice WidgetCorp North 1500 1700
1 102 Bob Solutions Ltd South 2200 2000
2 103 Charlie Innovations North 1800 1900
3 104 Diana Services Inc West 2500 2800
Support_Tickets sales_rep
0 2 John
1 1 Jane
2 3 Mike
3 1 Anna
Let's say we want to select all columns starting with "Customer" or "Sales".
Method 1: Using List Comprehension with str.startswith() (Pythonic & Clear)
This approach iterates through the column names and uses the built-in string method startswith().
Case-Sensitive Selection
import pandas as pd
df_example = pd.DataFrame({
'CustomerID': [101], 'CustomerName': ['Alice'], 'Region': ['North'],
'Sales_Q1': [1500], 'Sales_Q2': [1700], 'Support_Tickets': [2]
})
prefix_to_match = 'Sales'
# ✅ Create a list of column names that start with the prefix
selected_column_names_lc = [
col_name for col_name in df_example.columns
if col_name.startswith(prefix_to_match)
]
print(f"Column names starting with '{prefix_to_match}': {selected_column_names_lc}\n")
# Use this list to select the columns from the DataFrame
df_selected_lc = df_example[selected_column_names_lc]
print(f"DataFrame with columns starting with '{prefix_to_match}':")
print(df_selected_lc)
Output:
Column names starting with 'Sales': ['Sales_Q1', 'Sales_Q2']
DataFrame with columns starting with 'Sales':
Sales_Q1 Sales_Q2
0 1500 1700
Case-Insensitive Selection
To perform a case-insensitive match, convert both the column name and the prefix to a consistent case (e.g., lowercase) before comparison.
import pandas as pd
df_example = pd.DataFrame({
'Sales_Q1': [1500], 'sales_rep': ['John'], 'CustomerID': [101]
})
prefix_lower = 'sales' # Search for 'sales' case-insensitively
selected_column_names_lc_ci = [
col_name for col_name in df_example.columns
if col_name.lower().startswith(prefix_lower)
]
print(f"Column names starting with '{prefix_lower}' (case-insensitive): {selected_column_names_lc_ci}\n")
df_selected_lc_ci = df_example[selected_column_names_lc_ci]
print(f"DataFrame with columns starting with '{prefix_lower}' (case-insensitive):")
print(df_selected_lc_ci)
Output:
Column names starting with 'sales' (case-insensitive): ['Sales_Q1', 'sales_rep']
DataFrame with columns starting with 'sales' (case-insensitive):
Sales_Q1 sales_rep
0 1500 John
Method 2: Using df.columns.str.startswith() with df.loc or Direct Indexing (Vectorized)
Pandas Index objects (like df.columns) have a .str accessor that provides vectorized string methods.
Creating a Boolean Mask
df.columns.str.startswith(prefix) returns a boolean array indicating which column names start with the prefix.
import pandas as pd
df_example = pd.DataFrame({
'CustomerID': [101], 'CustomerName': ['Alice'], 'Sales_Q1': [1500]
})
prefix_to_match = 'Customer'
# ✅ Create a boolean mask for column names
column_mask = df_example.columns.str.startswith(prefix_to_match)
print(f"Boolean mask for columns starting with '{prefix_to_match}':")
print(column_mask)
Output:
Boolean mask for columns starting with 'Customer':
[ True True False]
Applying the Mask with .loc (Recommended)
Use df.loc[:, column_mask] to select all rows (:) and the columns where the mask is True.
import pandas as pd
df_example = pd.DataFrame({
'CustomerID': [101], 'CustomerName': ['Alice'], 'Sales_Q1': [1500]
})
prefix_to_match = 'Customer'
column_mask = df_example.columns.str.startswith(prefix_to_match)
# ✅ Select columns using .loc and the boolean mask
df_selected_loc = df_example.loc[:, column_mask]
print(f"DataFrame with columns starting with '{prefix_to_match}' (using .loc):")
print(df_selected_loc)
Output:
DataFrame with columns starting with 'Customer' (using .loc):
CustomerID CustomerName
0 101 Alice
Applying the Mask with Direct Bracket Indexing
You can also use the boolean mask directly with df[df.columns[column_mask]].
import pandas as pd
df_selected_direct = df_example[df_example.columns[column_mask]]
print(f"DataFrame with columns starting with '{prefix_to_match}' (direct indexing):")
print(df_selected_direct) # Same output as .loc
Output:
DataFrame with columns starting with 'Customer' (using .loc):
CustomerID CustomerName
0 101 Alice
While df.loc[:, column_mask] is generally preferred for clarity and explicit indexing, df[df.columns[column_mask]] also works because df.columns[column_mask] returns an Index of the selected column names.
For case-insensitive matching with str.startswith() on an Index, you'd need to convert the index to a Series or list first to use .lower(), or use flags=re.IGNORECASE if using str.contains() with a regex pattern (see Method 3).
Method 3: Using DataFrame.filter(regex=...) (Powerful for Patterns)
The DataFrame.filter() method can subset rows or columns. Using its regex parameter with axis=1 allows powerful pattern matching on column names. To match columns starting with a string, use the caret ^ in the regex to anchor the pattern to the beginning of the name.
import pandas as pd
df_example = pd.DataFrame({
'CustomerID': [101], 'CustomerName': ['Alice'], 'Sales_Q1': [1500], 'sales_rep': ['John']
})
prefix_to_match_exact_case = 'Sales'
# Regex: r'^Sales' matches columns that begin with "Sales" (case-sensitive)
df_selected_filter_cs = df_example.filter(regex=r'^' + prefix_to_match_exact_case, axis=1)
print(f"Columns starting with '{prefix_to_match_exact_case}' (filter, case-sensitive):")
print(df_selected_filter_cs)
print()
# For case-insensitive matching with regex, use the (?i) flag in the pattern
prefix_any_case = 'sales'
df_selected_filter_ci = df_example.filter(regex=r'(?i)^' + prefix_any_case, axis=1)
# Or, you can use re.IGNORECASE if constructing pattern carefully:
# import re
# df_selected_filter_ci_re = df.filter(regex=re.compile(r'^' + prefix_any_case, flags=re.IGNORECASE), axis=1)
print(f"Columns starting with '{prefix_any_case}' (filter, case-insensitive):")
print(df_selected_filter_ci)
Output:
Columns starting with 'Sales' (filter, case-sensitive):
Sales_Q1
0 1500
Columns starting with 'sales' (filter, case-insensitive):
Sales_Q1 sales_rep
0 1500 John
axis=1: Specifies that the filter should be applied to column names.regex=r'^Prefix': The^asserts the start of the string.regex=r'(?i)^Prefix': The(?i)flag within the regex pattern makes the match case-insensitive.
Choosing the Right Method
- List comprehension with
col.startswith(prefix): Very Pythonic, clear, and good for simple prefix matching. Easy to adapt for case-insensitivity. df.columns.str.startswith(prefix)then.loc: Vectorized and efficient. Good for boolean mask operations. Case-insensitivity is a bit more involved directly on the Index'sstr.startswith.df.filter(regex=r'^Prefix', axis=1): Most powerful if you need more complex starting patterns beyond a simple prefix (e.g., "starts with 'Sales_' followed by a digit"). Handles case-insensitivity well with regex flags.
For simple prefix matching, the list comprehension or df.columns.str.startswith() are often preferred for readability. For more complex patterns, filter(regex=...) excels.
Conclusion
Selecting Pandas DataFrame columns that start with a specific string can be achieved in several ways:
- Use a list comprehension iterating over
df.columnswithcol_name.startswith(prefix). - Create a boolean mask using
df.columns.str.startswith(prefix)and apply it withdf.loc[:, mask]. - Employ
df.filter(regex=r'^' + prefix, axis=1)for a concise and powerful regex-based solution.
Remember to handle case sensitivity as needed by converting strings to a consistent case or using appropriate flags/parameters (case=False for str.contains if used, or re.IGNORECASE/(?i) for regex).
These methods provide robust control over selecting columns based on their naming patterns.