Python Pandas: How to Use the Pandas get_dummies() Method for One-Hot Encoding

One-hot encoding is a technique for converting categorical data into a numerical format that machine learning algorithms can process. In Pandas, the get_dummies() function handles this conversion automatically, transforming each unique category into its own binary column.

This guide explains how get_dummies() works, covers its key parameters, and demonstrates practical usage patterns including handling missing values and avoiding common pitfalls.

What One-Hot Encoding Does

Categorical variables like "Red", "Blue", and "Green" have no inherent numerical relationship. One-hot encoding converts each category into a separate binary column where 1 indicates the category is present and 0 indicates it is not:

Original	Color_Blue	Color_Green	Color_Red
Red	0	0	1
Blue	1	0	0
Green	0	1	0

This representation allows algorithms that require numerical input to work with categorical data without implying any ordinal relationship between categories.

Syntax and Parameters

pandas.get_dummies(
    data,
    prefix=None,
    prefix_sep='_',
    dummy_na=False,
    columns=None,
    drop_first=False,
    dtype=None
)

Parameter	Description	Default
`data`	DataFrame, Series, or array-like to encode	Required
`prefix`	String or list of strings to prepend to new column names	Column name
`prefix_sep`	Separator between prefix and category value	`'_'`
`dummy_na`	If `True`, adds a column for `NaN` values	`False`
`columns`	List of columns to encode (DataFrame only)	All object/category columns
`drop_first`	If `True`, drops the first category to avoid multicollinearity	`False`
`dtype`	Data type for the new columns	`bool`

Encoding a DataFrame

When applied to a DataFrame, get_dummies() automatically detects and encodes all columns with object or category dtype:

import pandas as pd

data = {
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

df_encoded = pd.get_dummies(df)
print("\nEncoded DataFrame:")
print(df_encoded)

Output:

Original DataFrame:
   Color    Size
  Red   Small
 Blue   Large
Green  Medium
 Blue   Small
  Red   Large

Encoded DataFrame:
   Color_Blue  Color_Green  Color_Red  Size_Large  Size_Medium  Size_Small
     False        False       True       False        False        True
      True        False      False        True        False       False
     False         True      False       False         True       False
      True        False      False       False        False        True
     False        False       True        True        False       False

Each unique value in Color and Size becomes its own column. The values are True/False by default.

Getting 0s and 1s Instead of True/False

Most machine learning libraries prefer integer encoding. Set dtype=int to get 0 and 1:

import pandas as pd

data = {
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}

df = pd.DataFrame(data)

df_encoded = pd.get_dummies(df, dtype=int)
print(df_encoded)

Output:

   Color_Blue  Color_Green  Color_Red  Size_Large  Size_Medium  Size_Small
         0            0          1           0            0           1
         1            0          0           1            0           0
         0            1          0           0            1           0
         1            0          0           0            0           1
         0            0          1           1            0           0

Encoding a Series

get_dummies() also works directly on a Series:

import pandas as pd

data = {
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'Size': ['Small', 'Large', 'Medium', 'Small', 'Large']
}
df = pd.DataFrame(data)

days = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Monday'])
encoded = pd.get_dummies(days, dtype=int)
print(encoded)

Output:

   Friday  Monday  Thursday  Tuesday  Wednesday
     0       1         0        0          0
     0       0         0        1          0
     0       0         0        0          1
     0       0         1        0          0
     1       0         0        0          0
     0       1         0        0          0

Encoding Specific Columns Only

When your DataFrame has a mix of categorical and numeric columns, you may want to encode only certain columns. Use the columns parameter:

import pandas as pd

df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green'],
    'Size': ['S', 'M', 'L'],
    'Price': [10.5, 20.0, 15.5]
})

# Encode only the 'Color' column, leave 'Size' and 'Price' unchanged
encoded = pd.get_dummies(df, columns=['Color'], dtype=int)
print(encoded)

Output:

  Size  Price  Color_Blue  Color_Green  Color_Red
  S   10.5           0            0          1
  M   20.0           1            0          0
  L   15.5           0            1          0

note

The Size column remains as-is because it was not included in columns.

Using `drop_first` to Avoid Multicollinearity

In linear models, having all dummy columns creates multicollinearity - the last column is perfectly predictable from the others. Setting drop_first=True removes the first category from each set of dummies:

import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red']})

print("Without drop_first:")
print(pd.get_dummies(df, dtype=int))

print("\nWith drop_first=True:")
print(pd.get_dummies(df, drop_first=True, dtype=int))

Output:

Without drop_first:
   Color_Blue  Color_Green  Color_Red
0           0            0          1
1           1            0          0
2           0            1          0
3           0            0          1

With drop_first=True:
   Color_Green  Color_Red
0            0          1
1            0          0
2            1          0
3            0          1

With drop_first=True, Color_Blue is dropped. A row with all zeros in the remaining columns implicitly represents Blue.

tip

Use drop_first=True when building linear regression or logistic regression models to avoid the dummy variable trap. For tree-based models (Random Forest, XGBoost), keeping all columns is generally fine.

Handling Missing Values with `dummy_na`

By default, NaN values are ignored during encoding. Setting dummy_na=True creates a dedicated column for missing values:

import pandas as pd
import numpy as np

colors = pd.Series(['Red', 'Blue', 'Green', np.nan, 'Red', 'Blue'])

print("Without dummy_na (default):")
print(pd.get_dummies(colors, dtype=int))

print("\nWith dummy_na=True:")
print(pd.get_dummies(colors, dummy_na=True, dtype=int))

Output:

Without dummy_na (default):
   Blue  Green  Red
   0      0    1
   1      0    0
   0      1    0
   0      0    0
   0      0    1
   1      0    0

With dummy_na=True:
   Blue  Green  Red  NaN
   0      0    1    0
   1      0    0    0
   0      1    0    0
   0      0    0    1
   0      0    1    0
   1      0    0    0

Without dummy_na, row 3 has all zeros - there is no way to distinguish it from a potential fourth category. With dummy_na=True, the NaN column explicitly marks missing data.

Customizing Column Name Prefixes

Use the prefix parameter to control the naming of generated columns:

import pandas as pd

df = pd.DataFrame({
    'Color': ['Red', 'Blue'],
    'Size': ['S', 'L']
})

encoded = pd.get_dummies(df, prefix=['c', 's'], prefix_sep='-', dtype=int)
print(encoded)

Output:

   c-Blue  c-Red  s-L  s-S
0       0      1    0    1
1       1      0    1    0

The prefix c replaces Color and s replaces Size, with - as the separator.

Common Mistake: Encoding Numeric Columns Unintentionally

A frequent issue occurs when numeric columns are stored as strings. get_dummies() encodes all object dtype columns, which may include columns that look numeric:

import pandas as pd

df = pd.DataFrame({
    'Category': ['A', 'B', 'A'],
    'ZipCode': ['10001', '10002', '10001'],  # Stored as strings
    'Value': [100, 200, 150]
})

# WRONG: ZipCode gets encoded because it's stored as a string
encoded = pd.get_dummies(df, dtype=int)
print(encoded)

Output:

   Value  Category_A  Category_B  ZipCode_10001  ZipCode_10002
  100           1           0              1              0
  200           0           1              0              1
  150           1           0              1              0

ZipCode was encoded into dummy columns, which is likely unintended.

**The correct approach:**explicitly specify which columns to encode:

import pandas as pd

df = pd.DataFrame({
    'Category': ['A', 'B', 'A'],
    'ZipCode': ['10001', '10002', '10001'],  # Stored as strings
    'Value': [100, 200, 150]
})

# CORRECT: encode only the intended column
encoded = pd.get_dummies(df, columns=['Category'], dtype=int)
print(encoded)

Output:

  ZipCode  Value  Category_A  Category_B
 10001    100           1           0
 10002    200           0           1
 10001    150           1           0

warning

Always use the columns parameter to explicitly specify which columns should be encoded, especially when your DataFrame contains string-typed columns that should not be treated as categories (e.g., IDs, zip codes, names).

Combining Encoded Data with Existing Numeric Columns

A typical workflow keeps numeric columns intact while encoding only categorical ones:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Department': ['Sales', 'Engineering', 'Sales'],
    'Salary': [60000, 85000, 62000],
    'Years': [3, 7, 4]
})

# Encode only 'Department', keep everything else
result = pd.get_dummies(df, columns=['Department'], drop_first=True, dtype=int)
print(result)

Output:

      Name  Salary  Years  Department_Sales
  Alice   60000      3                 1
    Bob   85000      7                 0
Charlie   62000      4                 1

Quick Reference

Scenario	Parameters
Basic encoding with 0/1	`pd.get_dummies(df, dtype=int)`
Encode specific columns	`pd.get_dummies(df, columns=['col1'], dtype=int)`
Avoid multicollinearity	`pd.get_dummies(df, drop_first=True, dtype=int)`
Handle missing values	`pd.get_dummies(df, dummy_na=True, dtype=int)`
Custom column names	`pd.get_dummies(df, prefix='p', prefix_sep='_', dtype=int)`

The get_dummies() function is a quick and effective way to prepare categorical data for analysis and modeling.

For most machine learning pipelines, combining it with columns for targeted encoding and drop_first=True for linear models will cover the majority of use cases.

What One-Hot Encoding Does​

Syntax and Parameters​

Encoding a DataFrame​

Getting 0s and 1s Instead of True/False​

Encoding a Series​

Encoding Specific Columns Only​

Using drop_first to Avoid Multicollinearity​

Handling Missing Values with dummy_na​

Customizing Column Name Prefixes​

Common Mistake: Encoding Numeric Columns Unintentionally​

Combining Encoded Data with Existing Numeric Columns​

Quick Reference​

Table of Contents

What One-Hot Encoding Does

Syntax and Parameters

Encoding a DataFrame

Getting 0s and 1s Instead of True/False

Encoding a Series

Encoding Specific Columns Only

Using `drop_first` to Avoid Multicollinearity

Handling Missing Values with `dummy_na`

Customizing Column Name Prefixes

Common Mistake: Encoding Numeric Columns Unintentionally

Combining Encoded Data with Existing Numeric Columns

Quick Reference