Python Pandas: How to Use the Pandas get_dummies() Method for One-Hot Encoding
One-hot encoding is a technique for converting categorical data into a numerical format that machine learning algorithms can process. In Pandas, the get_dummies() function handles this conversion automatically, transforming each unique category into its own binary column.
This guide explains how get_dummies() works, covers its key parameters, and demonstrates practical usage patterns including handling missing values and avoiding common pitfalls.
What One-Hot Encoding Does
Categorical variables like "Red", "Blue", and "Green" have no inherent numerical relationship. One-hot encoding converts each category into a separate binary column where 1 indicates the category is present and 0 indicates it is not:
| Original | Color_Blue | Color_Green | Color_Red |
|---|---|---|---|
| Red | 0 | 0 | 1 |
| Blue | 1 | 0 | 0 |
| Green | 0 | 1 | 0 |
This representation allows algorithms that require numerical input to work with categorical data without implying any ordinal relationship between categories.
Syntax and Parameters
pandas.get_dummies(
data,
prefix=None,
prefix_sep='_',
dummy_na=False,
columns=None,
drop_first=False,
dtype=None
)
| Parameter | Description | Default |
|---|---|---|
data | DataFrame, Series, or array-like to encode | Required |
prefix | String or list of strings to prepend to new column names | Column name |
prefix_sep | Separator between prefix and category value | '_' |
dummy_na | If True, adds a column for NaN values | False |
columns | List of columns to encode (DataFrame only) | All object/category columns |
drop_first | If True, drops the first category to avoid multicollinearity | False |
dtype | Data type for the new columns | bool |
Encoding a DataFrame
When applied to a DataFrame, get_dummies() automatically detects and encodes all columns with object or category dtype:
import pandas as pd
data = {
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
df_encoded = pd.get_dummies(df)
print("\nEncoded DataFrame:")
print(df_encoded)
Output:
Original DataFrame:
Color Size
0 Red Small
1 Blue Large
2 Green Medium
3 Blue Small
4 Red Large
Encoded DataFrame:
Color_Blue Color_Green Color_Red Size_Large Size_Medium Size_Small
0 False False True False False True
1 True False False True False False
2 False True False False True False
3 True False False False False True
4 False False True True False False
Each unique value in Color and Size becomes its own column. The values are True/False by default.
Getting 0s and 1s Instead of True/False
Most machine learning libraries prefer integer encoding. Set dtype=int to get 0 and 1:
import pandas as pd
data = {
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}
df = pd.DataFrame(data)
df_encoded = pd.get_dummies(df, dtype=int)
print(df_encoded)
Output:
Color_Blue Color_Green Color_Red Size_Large Size_Medium Size_Small
0 0 0 1 0 0 1
1 1 0 0 1 0 0
2 0 1 0 0 1 0
3 1 0 0 0 0 1
4 0 0 1 1 0 0
Encoding a Series
get_dummies() also works directly on a Series:
import pandas as pd
data = {
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}
df = pd.DataFrame(data)
days = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Monday'])
encoded = pd.get_dummies(days, dtype=int)
print(encoded)
Output:
Friday Monday Thursday Tuesday Wednesday
0 0 1 0 0 0
1 0 0 0 1 0
2 0 0 0 0 1
3 0 0 1 0 0
4 1 0 0 0 0
5 0 1 0 0 0
Encoding Specific Columns Only
When your DataFrame has a mix of categorical and numeric columns, you may want to encode only certain columns. Use the columns parameter:
import pandas as pd
df = pd.DataFrame({
'Color': ['Red', 'Blue', 'Green'],
'Size': ['S', 'M', 'L'],
'Price': [10.5, 20.0, 15.5]
})
# Encode only the 'Color' column, leave 'Size' and 'Price' unchanged
encoded = pd.get_dummies(df, columns=['Color'], dtype=int)
print(encoded)
Output:
Size Price Color_Blue Color_Green Color_Red
0 S 10.5 0 0 1
1 M 20.0 1 0 0
2 L 15.5 0 1 0
The Size column remains as-is because it was not included in columns.
Using drop_first to Avoid Multicollinearity
In linear models, having all dummy columns creates multicollinearity - the last column is perfectly predictable from the others. Setting drop_first=True removes the first category from each set of dummies:
import pandas as pd
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red']})
print("Without drop_first:")
print(pd.get_dummies(df, dtype=int))
print("\nWith drop_first=True:")
print(pd.get_dummies(df, drop_first=True, dtype=int))
Output:
Without drop_first:
Color_Blue Color_Green Color_Red
0 0 0 1
1 1 0 0
2 0 1 0
3 0 0 1
With drop_first=True:
Color_Green Color_Red
0 0 1
1 0 0
2 1 0
3 0 1
With drop_first=True, Color_Blue is dropped. A row with all zeros in the remaining columns implicitly represents Blue.
Use drop_first=True when building linear regression or logistic regression models to avoid the dummy variable trap. For tree-based models (Random Forest, XGBoost), keeping all columns is generally fine.
Handling Missing Values with dummy_na
By default, NaN values are ignored during encoding. Setting dummy_na=True creates a dedicated column for missing values:
import pandas as pd
import numpy as np
colors = pd.Series(['Red', 'Blue', 'Green', np.nan, 'Red', 'Blue'])
print("Without dummy_na (default):")
print(pd.get_dummies(colors, dtype=int))
print("\nWith dummy_na=True:")
print(pd.get_dummies(colors, dummy_na=True, dtype=int))
Output:
Without dummy_na (default):
Blue Green Red
0 0 0 1
1 1 0 0
2 0 1 0
3 0 0 0
4 0 0 1
5 1 0 0
With dummy_na=True:
Blue Green Red NaN
0 0 0 1 0
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
4 0 0 1 0
5 1 0 0 0
Without dummy_na, row 3 has all zeros - there is no way to distinguish it from a potential fourth category. With dummy_na=True, the NaN column explicitly marks missing data.
Customizing Column Name Prefixes
Use the prefix parameter to control the naming of generated columns:
import pandas as pd
df = pd.DataFrame({
'Color': ['Red', 'Blue'],
'Size': ['S', 'L']
})
encoded = pd.get_dummies(df, prefix=['c', 's'], prefix_sep='-', dtype=int)
print(encoded)
Output:
c-Blue c-Red s-L s-S
0 0 1 0 1
1 1 0 1 0
The prefix c replaces Color and s replaces Size, with - as the separator.
Common Mistake: Encoding Numeric Columns Unintentionally
A frequent issue occurs when numeric columns are stored as strings. get_dummies() encodes all object dtype columns, which may include columns that look numeric:
import pandas as pd
df = pd.DataFrame({
'Category': ['A', 'B', 'A'],
'ZipCode': ['10001', '10002', '10001'], # Stored as strings
'Value': [100, 200, 150]
})
# WRONG: ZipCode gets encoded because it's stored as a string
encoded = pd.get_dummies(df, dtype=int)
print(encoded)
Output:
Value Category_A Category_B ZipCode_10001 ZipCode_10002
0 100 1 0 1 0
1 200 0 1 0 1
2 150 1 0 1 0
ZipCode was encoded into dummy columns, which is likely unintended.
**The correct approach:**explicitly specify which columns to encode:
import pandas as pd
df = pd.DataFrame({
'Category': ['A', 'B', 'A'],
'ZipCode': ['10001', '10002', '10001'], # Stored as strings
'Value': [100, 200, 150]
})
# CORRECT: encode only the intended column
encoded = pd.get_dummies(df, columns=['Category'], dtype=int)
print(encoded)
Output:
ZipCode Value Category_A Category_B
0 10001 100 1 0
1 10002 200 0 1
2 10001 150 1 0
Always use the columns parameter to explicitly specify which columns should be encoded, especially when your DataFrame contains string-typed columns that should not be treated as categories (e.g., IDs, zip codes, names).
Combining Encoded Data with Existing Numeric Columns
A typical workflow keeps numeric columns intact while encoding only categorical ones:
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Department': ['Sales', 'Engineering', 'Sales'],
'Salary': [60000, 85000, 62000],
'Years': [3, 7, 4]
})
# Encode only 'Department', keep everything else
result = pd.get_dummies(df, columns=['Department'], drop_first=True, dtype=int)
print(result)
Output:
Name Salary Years Department_Sales
0 Alice 60000 3 1
1 Bob 85000 7 0
2 Charlie 62000 4 1
Quick Reference
| Scenario | Parameters |
|---|---|
| Basic encoding with 0/1 | pd.get_dummies(df, dtype=int) |
| Encode specific columns | pd.get_dummies(df, columns=['col1'], dtype=int) |
| Avoid multicollinearity | pd.get_dummies(df, drop_first=True, dtype=int) |
| Handle missing values | pd.get_dummies(df, dummy_na=True, dtype=int) |
| Custom column names | pd.get_dummies(df, prefix='p', prefix_sep='_', dtype=int) |
The get_dummies() function is a quick and effective way to prepare categorical data for analysis and modeling.
For most machine learning pipelines, combining it with columns for targeted encoding and drop_first=True for linear models will cover the majority of use cases.