Python Pandas: How to Split Strings into Lists or Columns Using Pandas str.split()
Splitting string data into separate components is a common task in data cleaning and preparation. Names need to be separated into first and last, addresses into street and city, or delimited values into individual fields. The Pandas str.split() method handles all of these scenarios efficiently, letting you split strings across an entire Series or DataFrame column in a single operation.
This guide covers how to use str.split() to produce either lists within a Series or separate DataFrame columns.
Understanding str.split() Syntax
Series.str.split(pat=None, n=-1, expand=False)
| Parameter | Description | Default |
|---|---|---|
pat | The delimiter string or regex pattern to split on | Whitespace |
n | Maximum number of splits per string. -1 means no limit | -1 (all splits) |
expand | If True, returns a DataFrame with each split in a separate column. If False, returns a Series of lists | False |
Pandas' str.split() is different from Python's built-in str.split(). The Pandas version is accessed through the .str accessor and operates on an entire Series at once, handling NaN values automatically.
Sample DataFrame
The examples below use this DataFrame:
import pandas as pd
df = pd.DataFrame({
'Name': ['John Smith', 'Alice Johnson', 'Bob Williams', 'Eve Davis'],
'Team': ['Boston Celtics', 'Portland Trail Blazers', 'Detroit Pistons', 'Atlanta Hawks'],
'Salary': [50000, 65000, 48000, 72000]
})
print(df)
Output:
Name Team Salary
0 John Smith Boston Celtics 50000
1 Alice Johnson Portland Trail Blazers 65000
2 Bob Williams Detroit Pistons 48000
3 Eve Davis Atlanta Hawks 72000
Splitting Strings into a List (Series of Lists)
When expand=False (the default), str.split() returns a Series where each element is a list of the split components:
import pandas as pd
df = pd.DataFrame({
'Name': ['John Smith', 'Alice Johnson', 'Bob Williams', 'Eve Davis']
})
# Split each name into a list of parts
name_lists = df['Name'].str.split(' ')
print(name_lists)
print("\nType of each element:", type(name_lists[0]))
Output:
0 [John, Smith]
1 [Alice, Johnson]
2 [Bob, Williams]
3 [Eve, Davis]
Name: Name, dtype: object
Type of each element: <class 'list'>
Each cell now contains a Python list. This is useful when you want to keep all parts together in a single column.
Limiting the Number of Splits
Use the n parameter to control how many splits occur. This is essential when strings have varying numbers of delimiters:
import pandas as pd
df = pd.DataFrame({
'Team': ['Portland Trail Blazers', 'Boston Celtics', 'Golden State Warriors']
})
# Split at most once, produces exactly 2 parts
split_result = df['Team'].str.split(' ', n=1)
print(split_result)
Output:
0 [Portland, Trail Blazers]
1 [Boston, Celtics]
2 [Golden, State Warriors]
Name: Team, dtype: object
With n=1, only the first space is used as a split point. Everything after it stays as a single string.
Splitting Strings into Separate Columns
Setting expand=True returns a DataFrame with each split part in its own column. This is the most common approach for creating new structured columns from a single string column.
Splitting Names into First and Last Name
import pandas as pd
df = pd.DataFrame({
'Name': ['John Smith', 'Alice Johnson', 'Bob Williams', 'Eve Davis'],
'Salary': [50000, 65000, 48000, 72000]
})
# Split into two columns
name_parts = df['Name'].str.split(' ', n=1, expand=True)
print("Split result:")
print(name_parts)
# Assign to new columns
df['First_Name'] = name_parts[0]
df['Last_Name'] = name_parts[1]
# Drop the original column
df = df.drop(columns=['Name'])
print("\nFinal DataFrame:")
print(df)
Output:
Split result:
0 1
0 John Smith
1 Alice Johnson
2 Bob Williams
3 Eve Davis
Final DataFrame:
Salary First_Name Last_Name
0 50000 John Smith
1 65000 Alice Johnson
2 48000 Bob Williams
3 72000 Eve Davis
The split produces columns numbered 0 and 1, which are then assigned to descriptively named columns.
Splitting with a Custom Delimiter
You can split on any character or string, not just spaces:
import pandas as pd
df = pd.DataFrame({
'Date': ['2024-01-15', '2024-06-20', '2024-12-31']
})
# Split dates on the hyphen
date_parts = df['Date'].str.split('-', expand=True)
date_parts.columns = ['Year', 'Month', 'Day']
print(date_parts)
Output:
Year Month Day
0 2024 01 15
1 2024 06 20
2 2024 12 31
Using apply() with str.split() for Custom Logic
For more complex splitting scenarios, combine apply() with a custom function:
import pandas as pd
df = pd.DataFrame({
'Name': ['John Smith', 'Alice Marie Johnson', 'Bob Williams']
})
def split_name(name):
"""Split into first name and everything else as last name."""
parts = name.split(' ', 1)
if len(parts) == 2:
return pd.Series({'First': parts[0], 'Last': parts[1]})
return pd.Series({'First': parts[0], 'Last': ''})
result = df['Name'].apply(split_name)
print(result)
Output:
First Last
0 John Smith
1 Alice Marie Johnson
2 Bob Williams
This approach gives you full control over how splits are handled, including edge cases like names with middle names.
Common Mistake: Uneven Splits Without n Parameter
When strings have different numbers of delimiters and you use expand=True without setting n, the resulting DataFrame may have an inconsistent number of columns - with None filling shorter rows:
import pandas as pd
df = pd.DataFrame({
'Team': ['Portland Trail Blazers', 'Boston Celtics', 'Atlanta Hawks']
})
# PROBLEMATIC: no limit on splits, different rows produce different numbers of parts
result = df['Team'].str.split(' ', expand=True)
print(result)
Output:
0 1 2
0 Portland Trail Blazers
1 Boston Celtics None
2 Atlanta Hawks None
Column 2 has None for rows that only split into two parts. This can cause issues in downstream processing.
The correct approach:
Use n to ensure a consistent number of columns:
import pandas as pd
df = pd.DataFrame({
'Team': ['Portland Trail Blazers', 'Boston Celtics', 'Atlanta Hawks']
})
# CORRECT: limit to 1 split, always produces exactly 2 columns
result = df['Team'].str.split(' ', n=1, expand=True)
result.columns = ['City', 'Mascot']
print(result)
Output:
City Mascot
0 Portland Trail Blazers
1 Boston Celtics
2 Atlanta Hawks
When splitting strings that have a variable number of delimiters, always set the n parameter to control the maximum number of splits. Without it, expand=True creates columns that may contain None values, leading to unexpected behavior in subsequent operations.
Handling NaN Values
str.split() handles NaN values gracefully - they remain as NaN in the output:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['John Smith', np.nan, 'Bob Williams', None]
})
result = df['Name'].str.split(' ', expand=True)
print(result)
Output:
0 1
0 John Smith
1 NaN NaN
2 Bob Williams
3 None None
You don't need to filter out NaN values before splitting. The .str accessor automatically propagates NaN through string operations, keeping your data aligned.
Splitting and Accessing Specific Parts
If you only need one part of the split (e.g., just the first name), you can use .str[index] on the result:
import pandas as pd
df = pd.DataFrame({
'Email': ['john@gmail.com', 'alice@yahoo.com', 'bob@outlook.com']
})
# Extract just the username (part before @)
df['Username'] = df['Email'].str.split('@').str[0]
# Extract just the domain
df['Domain'] = df['Email'].str.split('@').str[1]
print(df)
Output:
Email Username Domain
0 john@gmail.com john gmail.com
1 alice@yahoo.com alice yahoo.com
2 bob@outlook.com bob outlook.com
This avoids creating intermediate DataFrames when you only need specific parts.
Quick Reference
| Goal | Code | Returns |
|---|---|---|
| Split into lists | df['col'].str.split(' ') | Series of lists |
| Split into columns | df['col'].str.split(' ', expand=True) | DataFrame |
| Limit splits | df['col'].str.split(' ', n=1, expand=True) | DataFrame with n+1 columns |
| Get first part only | df['col'].str.split(' ').str[0] | Series |
| Split on custom delimiter | df['col'].str.split('-', expand=True) | DataFrame |
| Custom split logic | df['col'].apply(custom_function) | Series or DataFrame |
The str.split() method is a versatile tool for breaking apart string data in Pandas.
Use expand=False when you want to keep split parts as lists within cells, and expand=True when you need clean, separate columns.
Always set the n parameter when your data has an inconsistent number of delimiters to ensure predictable results.