Skip to main content

Python Pandas: How to Shuffle DataFrame Rows in Pandas

Shuffling the rows of a DataFrame is a common operation in data science and machine learning workflows - especially when preparing data for model training, randomizing survey results, or breaking any inherent ordering in a dataset. Pandas and NumPy provide several ways to randomly reorder rows, each with different trade-offs in simplicity, performance, and flexibility.

This guide covers four methods with clear examples, outputs, and explanations.

Sample DataFrame

All examples in this guide use the following DataFrame:

import pandas as pd

data = {'A': range(1, 11), 'B': list('abcdefghij')}
df = pd.DataFrame(data)
print(df)

Output:

    A  B
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
6 7 g
7 8 h
8 9 i
9 10 j

The simplest and most idiomatic way to shuffle rows in Pandas is the sample() method. Setting frac=1 tells Pandas to return 100% of the rows in a random order:

import pandas as pd

data = {'A': range(1, 11), 'B': list('abcdefghij')}
df = pd.DataFrame(data)

shuffled = df.sample(frac=1).reset_index(drop=True)
print(shuffled)

Output:

    A  B
0 3 c
1 6 f
2 1 a
3 8 h
4 5 e
5 7 g
6 4 d
7 9 i
8 10 j
9 2 b
  • sample(frac=1) randomly selects all rows, effectively shuffling them.
  • reset_index(drop=True) replaces the old index with a clean 0-based sequence. Without it, the original index values are preserved (e.g., row 8 would still have index 7).
tip

For reproducible shuffles (useful in testing and machine learning), pass a random_state parameter:

shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)

Using the same random_state value always produces the same shuffled order.

Why Reset the Index?

If you skip reset_index(drop=True), the shuffled DataFrame retains the original index values:

import pandas as pd

data = {'A': range(1, 11), 'B': list('abcdefghij')}
df = pd.DataFrame(data)

shuffled_no_reset = df.sample(frac=1)
print(shuffled_no_reset)

Output:

    A  B
6 7 g
4 5 e
5 6 f
0 1 a
8 9 i
3 4 d
9 10 j
2 3 c
1 2 b
7 8 h

Notice the index column is out of order (7, 3, 6, 0, ...). This can cause confusion when iterating or slicing by position later. Always reset the index after shuffling unless you specifically need to track the original row positions.

Using numpy.random.permutation

This method generates a permuted array of row indices using NumPy and then reorders the DataFrame with iloc:

import numpy as np
import pandas as pd

data = {'A': range(1, 11), 'B': list('abcdefghij')}
df = pd.DataFrame(data)

permuted_indices = np.random.permutation(len(df))
shuffled = df.iloc[permuted_indices].reset_index(drop=True)
print(shuffled)

Output:

    A  B
0 10 j
1 4 d
2 8 h
3 9 i
4 7 g
5 1 a
6 6 f
7 5 e
8 3 c
9 2 b
note

np.random.permutation(len(df)) returns an array like [0, 8, 6, 2, 9, 3, 7, 5, 4, 1] (a random reordering of indices 0–9). The iloc accessor then selects rows in that order.

This approach is useful when you are already working with NumPy or need the permuted indices for other purposes (e.g., shuffling a separate array in the same order).

Using numpy.random.shuffle with Index

This method extracts the DataFrame's index as a list, shuffles it in place, and then uses loc to reorder the rows:

import numpy as np
import pandas as pd

data = {'A': range(1, 11), 'B': list('abcdefghij')}
df = pd.DataFrame(data)

idx = df.index.to_list()
np.random.shuffle(idx)

shuffled = df.loc[idx].reset_index(drop=True)
print(shuffled)

Output:

    A  B
0 8 h
1 5 e
2 10 j
3 7 g
4 9 i
5 4 d
6 2 b
7 1 a
8 6 f
9 3 c

The key difference from the previous method is that np.random.shuffle() modifies the list in place and returns None, so you cannot chain it into a single expression.

warning

A common mistake is trying to use the return value of np.random.shuffle():

# WRONG: np.random.shuffle returns None, not the shuffled array
shuffled = df.loc[np.random.shuffle(df.index.to_list())]
# TypeError: 'NoneType' object is not iterable

The correct approach is to shuffle the list first, then pass it to loc:

import numpy as np
import pandas as pd

data = {'A': range(1, 11), 'B': list('abcdefghij')}
df = pd.DataFrame(data)

idx = df.index.to_list()
np.random.shuffle(idx) # Modifies idx in place
shuffled = df.loc[idx].reset_index(drop=True)

print(shuffled)

Output:

    A  B
0 10 j
1 1 a
2 6 f
3 4 d
4 3 c
5 8 h
6 5 e
7 2 b
8 9 i
9 7 g

Using sort_values with a Random Column

This approach assigns a column of random numbers to the DataFrame, sorts by that column, and then drops it. While less efficient due to the sorting step, it can be useful when you want to apply weighted or semi-random ordering:

import numpy as np
import pandas as pd

data = {'A': range(1, 11), 'B': list('abcdefghij')}
df = pd.DataFrame(data)

shuffled = (
df.assign(rand_key=np.random.rand(len(df)))
.sort_values('rand_key')
.drop('rand_key', axis=1)
.reset_index(drop=True)
)
print(shuffled)

Output:

    A  B
0 5 e
1 7 g
2 4 d
3 1 a
4 8 h
5 6 f
6 9 i
7 3 c
8 2 b
9 10 j
  • assign(rand_key=np.random.rand(len(df))) adds a temporary column of random floats between 0 and 1.
  • sort_values('rand_key') sorts the DataFrame by those random values, effectively shuffling it.
  • drop('rand_key', axis=1) removes the temporary column.

This method has O(n log n) complexity due to sorting, compared to O(n) for the sample() approach. Use it only when you have a specific reason to sort rather than sample.

Method Comparison

MethodComplexitySimplicityReproducibleBest For
df.sample(frac=1)O(n)★★★★★Yes (random_state)General use - recommended
np.random.permutationO(n)★★★★Yes (np.random.seed)When you need the index array
np.random.shuffleO(n)★★★Yes (np.random.seed)In-place index shuffling
sort_values with random columnO(n log n)★★★Yes (np.random.seed)Weighted/semi-random ordering
tip

For most use cases, df.sample(frac=1) is the best choice. It is concise, fast, built into Pandas, and supports reproducibility through the random_state parameter. Reserve the other methods for scenarios where you need more control over the shuffling mechanism.