Skip to main content

Python Pandas: How to Use the Pandas get_dummies() Method for One-Hot Encoding

One-hot encoding is a technique for converting categorical data into a numerical format that machine learning algorithms can process. In Pandas, the get_dummies() function handles this conversion automatically, transforming each unique category into its own binary column.

This guide explains how get_dummies() works, covers its key parameters, and demonstrates practical usage patterns including handling missing values and avoiding common pitfalls.

What One-Hot Encoding Does

Categorical variables like "Red", "Blue", and "Green" have no inherent numerical relationship. One-hot encoding converts each category into a separate binary column where 1 indicates the category is present and 0 indicates it is not:

OriginalColor_BlueColor_GreenColor_Red
Red001
Blue100
Green010

This representation allows algorithms that require numerical input to work with categorical data without implying any ordinal relationship between categories.

Syntax and Parameters

pandas.get_dummies(
data,
prefix=None,
prefix_sep='_',
dummy_na=False,
columns=None,
drop_first=False,
dtype=None
)
ParameterDescriptionDefault
dataDataFrame, Series, or array-like to encodeRequired
prefixString or list of strings to prepend to new column namesColumn name
prefix_sepSeparator between prefix and category value'_'
dummy_naIf True, adds a column for NaN valuesFalse
columnsList of columns to encode (DataFrame only)All object/category columns
drop_firstIf True, drops the first category to avoid multicollinearityFalse
dtypeData type for the new columnsbool

Encoding a DataFrame

When applied to a DataFrame, get_dummies() automatically detects and encodes all columns with object or category dtype:

import pandas as pd

data = {
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

df_encoded = pd.get_dummies(df)
print("\nEncoded DataFrame:")
print(df_encoded)

Output:

Original DataFrame:
Color Size
0 Red Small
1 Blue Large
2 Green Medium
3 Blue Small
4 Red Large

Encoded DataFrame:
Color_Blue Color_Green Color_Red Size_Large Size_Medium Size_Small
0 False False True False False True
1 True False False True False False
2 False True False False True False
3 True False False False False True
4 False False True True False False

Each unique value in Color and Size becomes its own column. The values are True/False by default.

Getting 0s and 1s Instead of True/False

Most machine learning libraries prefer integer encoding. Set dtype=int to get 0 and 1:

import pandas as pd

data = {
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}

df = pd.DataFrame(data)

df_encoded = pd.get_dummies(df, dtype=int)
print(df_encoded)

Output:

   Color_Blue  Color_Green  Color_Red  Size_Large  Size_Medium  Size_Small
0 0 0 1 0 0 1
1 1 0 0 1 0 0
2 0 1 0 0 1 0
3 1 0 0 0 0 1
4 0 0 1 1 0 0

Encoding a Series

get_dummies() also works directly on a Series:

import pandas as pd

data = {
'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}
df = pd.DataFrame(data)

days = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Monday'])
encoded = pd.get_dummies(days, dtype=int)
print(encoded)

Output:

   Friday  Monday  Thursday  Tuesday  Wednesday
0 0 1 0 0 0
1 0 0 0 1 0
2 0 0 0 0 1
3 0 0 1 0 0
4 1 0 0 0 0
5 0 1 0 0 0

Encoding Specific Columns Only

When your DataFrame has a mix of categorical and numeric columns, you may want to encode only certain columns. Use the columns parameter:

import pandas as pd

df = pd.DataFrame({
'Color': ['Red', 'Blue', 'Green'],
'Size': ['S', 'M', 'L'],
'Price': [10.5, 20.0, 15.5]
})

# Encode only the 'Color' column, leave 'Size' and 'Price' unchanged
encoded = pd.get_dummies(df, columns=['Color'], dtype=int)
print(encoded)

Output:

  Size  Price  Color_Blue  Color_Green  Color_Red
0 S 10.5 0 0 1
1 M 20.0 1 0 0
2 L 15.5 0 1 0
note

The Size column remains as-is because it was not included in columns.

Using drop_first to Avoid Multicollinearity

In linear models, having all dummy columns creates multicollinearity - the last column is perfectly predictable from the others. Setting drop_first=True removes the first category from each set of dummies:

import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red']})

print("Without drop_first:")
print(pd.get_dummies(df, dtype=int))

print("\nWith drop_first=True:")
print(pd.get_dummies(df, drop_first=True, dtype=int))

Output:

Without drop_first:
Color_Blue Color_Green Color_Red
0 0 0 1
1 1 0 0
2 0 1 0
3 0 0 1

With drop_first=True:
Color_Green Color_Red
0 0 1
1 0 0
2 1 0
3 0 1

With drop_first=True, Color_Blue is dropped. A row with all zeros in the remaining columns implicitly represents Blue.

tip

Use drop_first=True when building linear regression or logistic regression models to avoid the dummy variable trap. For tree-based models (Random Forest, XGBoost), keeping all columns is generally fine.

Handling Missing Values with dummy_na

By default, NaN values are ignored during encoding. Setting dummy_na=True creates a dedicated column for missing values:

import pandas as pd
import numpy as np

colors = pd.Series(['Red', 'Blue', 'Green', np.nan, 'Red', 'Blue'])

print("Without dummy_na (default):")
print(pd.get_dummies(colors, dtype=int))

print("\nWith dummy_na=True:")
print(pd.get_dummies(colors, dummy_na=True, dtype=int))

Output:

Without dummy_na (default):
Blue Green Red
0 0 0 1
1 1 0 0
2 0 1 0
3 0 0 0
4 0 0 1
5 1 0 0

With dummy_na=True:
Blue Green Red NaN
0 0 0 1 0
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
4 0 0 1 0
5 1 0 0 0

Without dummy_na, row 3 has all zeros - there is no way to distinguish it from a potential fourth category. With dummy_na=True, the NaN column explicitly marks missing data.

Customizing Column Name Prefixes

Use the prefix parameter to control the naming of generated columns:

import pandas as pd

df = pd.DataFrame({
'Color': ['Red', 'Blue'],
'Size': ['S', 'L']
})

encoded = pd.get_dummies(df, prefix=['c', 's'], prefix_sep='-', dtype=int)
print(encoded)

Output:

   c-Blue  c-Red  s-L  s-S
0 0 1 0 1
1 1 0 1 0

The prefix c replaces Color and s replaces Size, with - as the separator.

Common Mistake: Encoding Numeric Columns Unintentionally

A frequent issue occurs when numeric columns are stored as strings. get_dummies() encodes all object dtype columns, which may include columns that look numeric:

import pandas as pd

df = pd.DataFrame({
'Category': ['A', 'B', 'A'],
'ZipCode': ['10001', '10002', '10001'], # Stored as strings
'Value': [100, 200, 150]
})

# WRONG: ZipCode gets encoded because it's stored as a string
encoded = pd.get_dummies(df, dtype=int)
print(encoded)

Output:

   Value  Category_A  Category_B  ZipCode_10001  ZipCode_10002
0 100 1 0 1 0
1 200 0 1 0 1
2 150 1 0 1 0

ZipCode was encoded into dummy columns, which is likely unintended.

**The correct approach:**explicitly specify which columns to encode:

import pandas as pd

df = pd.DataFrame({
'Category': ['A', 'B', 'A'],
'ZipCode': ['10001', '10002', '10001'], # Stored as strings
'Value': [100, 200, 150]
})

# CORRECT: encode only the intended column
encoded = pd.get_dummies(df, columns=['Category'], dtype=int)
print(encoded)

Output:

  ZipCode  Value  Category_A  Category_B
0 10001 100 1 0
1 10002 200 0 1
2 10001 150 1 0
warning

Always use the columns parameter to explicitly specify which columns should be encoded, especially when your DataFrame contains string-typed columns that should not be treated as categories (e.g., IDs, zip codes, names).

Combining Encoded Data with Existing Numeric Columns

A typical workflow keeps numeric columns intact while encoding only categorical ones:

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Department': ['Sales', 'Engineering', 'Sales'],
'Salary': [60000, 85000, 62000],
'Years': [3, 7, 4]
})

# Encode only 'Department', keep everything else
result = pd.get_dummies(df, columns=['Department'], drop_first=True, dtype=int)
print(result)

Output:

      Name  Salary  Years  Department_Sales
0 Alice 60000 3 1
1 Bob 85000 7 0
2 Charlie 62000 4 1

Quick Reference

ScenarioParameters
Basic encoding with 0/1pd.get_dummies(df, dtype=int)
Encode specific columnspd.get_dummies(df, columns=['col1'], dtype=int)
Avoid multicollinearitypd.get_dummies(df, drop_first=True, dtype=int)
Handle missing valuespd.get_dummies(df, dummy_na=True, dtype=int)
Custom column namespd.get_dummies(df, prefix='p', prefix_sep='_', dtype=int)

The get_dummies() function is a quick and effective way to prepare categorical data for analysis and modeling.

For most machine learning pipelines, combining it with columns for targeted encoding and drop_first=True for linear models will cover the majority of use cases.