How to Find the Median of a List in Python
The median represents the "middle" value in a dataset and provides a robust measure of central tendency that is less affected by outliers than the mean. For example, in the dataset [1, 2, 3, 4, 1000], the mean is 202 but the median is 3, which better represents the typical value.
In this guide, you will learn how to calculate the median using Python's standard library, a manual implementation, and NumPy for large datasets. Each approach is explained with clear examples and output so you can choose the right one for your situation.
Using the statistics Module
The statistics module from Python's standard library provides the most straightforward solution. It handles sorting automatically and correctly manages both odd-length and even-length lists:
import statistics
# Odd-length list: the middle value is returned directly
data = [10, 2, 38, 23, 38]
print(statistics.median(data))
# Even-length list: the average of the two middle values is returned
even_data = [10, 2, 38, 23]
print(statistics.median(even_data))
Output:
23
16.5
For the odd-length list [10, 2, 38, 23, 38], sorting gives [2, 10, 23, 38, 38], and the middle element is 23. For the even-length list [10, 2, 38, 23], sorting gives [2, 10, 23, 38], and the median is the average of the two middle values: (10 + 23) / 2 = 16.5.
median_low() and median_high()The statistics module also provides median_low() and median_high() for even-length lists when you need the actual lower or upper middle value instead of their average:
import statistics
data = [10, 2, 38, 23]
print(statistics.median_low(data)) # 10 (lower middle value)
print(statistics.median_high(data)) # 23 (upper middle value)
print(statistics.median(data)) # 16.5 (average of both)
These are useful when your data consists of discrete values where an average would not be meaningful, such as letter grades or categories.
Manual Calculation
Understanding the underlying algorithm is valuable for interviews, educational purposes, or environments where external libraries are not available:
def median(nums):
if not nums:
raise ValueError("Cannot compute median of empty list")
sorted_nums = sorted(nums)
mid = len(sorted_nums) // 2
if len(sorted_nums) % 2 == 0:
return (sorted_nums[mid - 1] + sorted_nums[mid]) / 2
return sorted_nums[mid]
# Odd-length list
print(median([1, 3, 5, 7, 9]))
# Even-length list
print(median([1, 3, 5, 7]))
Output:
5
4.0
How It Works Step by Step
The algorithm follows three steps:
- Sort the list to arrange values in ascending order.
- Find the middle index using integer division (
len // 2). - Check if the length is even or odd:
- Odd: return the element at the middle index directly.
- Even: return the average of the two middle elements.
Here is a visual breakdown for both cases:
# Odd-length list: [10, 2, 38, 23, 38]
# Sorted: [2, 10, 23, 38, 38]
# ^
# Length: 5, Mid index: 2
# Result: sorted_nums[2] = 23
# Even-length list: [10, 2, 38, 23]
# Sorted: [2, 10, 23, 38]
# ^ ^
# Length: 4, Mid index: 2
# Result: (sorted_nums[1] + sorted_nums[2]) / 2 = (10 + 23) / 2 = 16.5
Using NumPy for Large Datasets
When working with millions of values, NumPy provides optimized performance through vectorized operations written in C:
import numpy as np
data = [10, 2, 38, 23, 38]
print(np.median(data))
even_data = [10, 2, 38, 23]
print(np.median(even_data))
Output:
23.0
16.5
NumPy always returns a float, even when the median is a whole number (23.0 instead of 23). Keep this in mind if you need integer results or are comparing values with strict type checking.
NumPy's advantage becomes clear with large datasets:
import numpy as np
# Generate a large dataset
large_data = np.random.randint(0, 1000, size=1_000_000)
result = np.median(large_data)
print(f"Median of 1,000,000 values: {result}")
Example output:
Median of 1,000,000 values: 499.5
Handling Edge Cases
A robust median implementation should account for special scenarios:
import statistics
# Single element
print(statistics.median([42]))
# Two elements
print(statistics.median([10, 20]))
# Negative numbers
print(statistics.median([-5, -1, 0, 3, 10]))
# Floating-point numbers
print(statistics.median([1.5, 2.7, 3.2]))
Output:
42
15.0
0
2.7
An empty list raises an exception, which is the correct behavior since the median of no data is undefined:
import statistics
statistics.median([])
Output:
statistics.StatisticsError: no median for empty data
Why the Median Resists Outliers
The median is preferred over the mean when your data may contain extreme values:
import statistics
salaries = [35000, 40000, 42000, 45000, 50000, 5000000]
print(f"Mean: {statistics.mean(salaries):,.0f}")
print(f"Median: {statistics.median(salaries):,.0f}")
Output:
Mean: 868,667
Median: 43,500
The single salary of 5,000,000 pulls the mean up to 868,667, which does not represent any typical salary in the dataset. The median of 43,500 is a much more representative measure of the central tendency.
Performance Comparison
For large datasets, the choice of method affects execution speed:
import timeit
import statistics
import numpy as np
data = list(range(10000))
stats_time = timeit.timeit(lambda: statistics.median(data), number=1000)
np_time = timeit.timeit(lambda: np.median(data), number=1000)
print(f"statistics.median: {stats_time:.4f}s")
print(f"np.median: {np_time:.4f}s")
Example output (times vary by system):
statistics.median: 2.1543s
np.median: 0.4821s
NumPy is significantly faster for large datasets because its sorting and computation are implemented in optimized C code.
Method Comparison
| Method | Auto-Sorts | Return Type | Best For |
|---|---|---|---|
statistics.median() | Yes | int or float | General-purpose, standard library |
| Manual calculation | No (you sort) | int or float | Learning, interviews, no dependencies |
np.median() | Yes | float (always) | Large datasets, scientific computing |
Conclusion
- For most applications,
statistics.median()offers the best balance of simplicity, readability, and reliability. It is part of the standard library, handles all edge cases correctly, and requires no installation. - Switch to NumPy when processing large datasets where performance becomes critical, keeping in mind that it always returns a float.
Understanding the manual approach is valuable for interviews and for situations where you cannot use external libraries or need to customize the algorithm for specific requirements.