How to Calculate Moving Window or Rolling Averages in Python
Calculating the average of consecutive segments, commonly called a Moving Window or Rolling Average, is essential in data analysis. It smooths out noisy data like stock prices, sensor readings, or website traffic.
This guide covers three approaches: Standard Python for simplicity, NumPy for performance, and Pandas for time-series analysis.
Understanding Moving Averages
A moving average slides a window across your data, calculating the average at each position.
Example: Data [10, 20, 30, 40] with window size 2
| Position | Window | Calculation | Result |
|---|---|---|---|
| 0 | [10, 20] | (10 + 20) / 2 | 15 |
| 1 | [20, 30] | (20 + 30) / 2 | 25 |
| 2 | [30, 40] | (30 + 40) / 2 | 35 |
Using Standard Python
For small datasets or when avoiding external dependencies, use a list comprehension with slicing.
def moving_average(data: list, window_size: int) -> list:
"""Calculate moving average using list comprehension."""
return [
sum(data[i:i + window_size]) / window_size
for i in range(len(data) - window_size + 1)
]
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
result = moving_average(data, window_size=3)
print(result)
Output:
[2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
Handling Edge Cases
def moving_average_safe(data: list, window_size: int) -> list:
"""Calculate moving average with validation."""
if not data:
return []
if window_size <= 0:
raise ValueError("Window size must be positive")
if window_size > len(data):
raise ValueError("Window size cannot exceed data length")
return [
sum(data[i:i + window_size]) / window_size
for i in range(len(data) - window_size + 1)
]
Using NumPy for Performance
For large datasets (10,000+ elements), Python loops become slow. NumPy's convolve function uses optimized C code for significant speed improvements.
import numpy as np
def moving_average_numpy(data: np.ndarray, window_size: int) -> np.ndarray:
"""Calculate moving average using NumPy convolution."""
kernel = np.ones(window_size) / window_size
return np.convolve(data, kernel, mode='valid')
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
result = moving_average_numpy(data, window_size=3)
print(result)
Output:
[2. 3. 4. 5. 6. 7. 8. 9.]
Convolution slides a "kernel" (weights) across the data. When the kernel is [1/n, 1/n, ..., 1/n], the operation mathematically equals calculating the average at each position.
Using cumsum for Even Faster Results
For very large arrays, cumulative sum is more efficient:
import numpy as np
def moving_average_cumsum(data: np.ndarray, window_size: int) -> np.ndarray:
"""Calculate moving average using cumulative sum (fastest)."""
cumsum = np.cumsum(np.insert(data, 0, 0))
return (cumsum[window_size:] - cumsum[:-window_size]) / window_size
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
result = moving_average_cumsum(data, window_size=3)
print(result)
Output:
[2. 3. 4. 5. 6. 7. 8. 9.]
Using Pandas for Time-Series
Pandas provides the .rolling() method, designed specifically for time-series analysis. It handles missing data automatically and integrates seamlessly with DataFrames.
import pandas as pd
# Create sample stock data
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=6),
'price': [100, 102, 104, 103, 105, 108]
})
# Calculate 3-day moving average
df['ma_3'] = df['price'].rolling(window=3).mean()
print(df)
Output:
date price ma_3
0 2024-01-01 100 NaN
1 2024-01-02 102 NaN
2 2024-01-03 104 102.000000
3 2024-01-04 103 103.000000
4 2024-01-05 105 104.000000
5 2024-01-06 108 105.333333
The first window_size - 1 rows show NaN because there isn't enough data to fill the window. Use min_periods to require fewer values:
df['ma_3'] = df['price'].rolling(window=3, min_periods=1).mean()
Multiple Moving Averages
Financial analysis often compares short and long-term trends:
import pandas as pd
df = pd.DataFrame({
'price': [100, 102, 101, 105, 107, 106, 110, 112, 115, 113]
})
# Short-term and long-term moving averages
df['ma_3'] = df['price'].rolling(window=3).mean()
df['ma_5'] = df['price'].rolling(window=5).mean()
print(df)
Output:
price ma_3 ma_5
0 100 NaN NaN
1 102 NaN NaN
2 101 101.000000 NaN
3 105 102.666667 NaN
4 107 104.333333 103.0
5 106 106.000000 104.2
6 110 107.666667 105.8
7 112 109.333333 108.0
8 115 112.333333 110.0
9 113 113.333333 111.2
Weighted Moving Average
Give more importance to recent values:
import pandas as pd
df = pd.DataFrame({'price': [100, 102, 104, 103, 105, 108]})
# Exponential Weighted Moving Average
df['ewma'] = df['price'].ewm(span=3).mean()
print(df)
Output:
price ewma
0 100 100.000000
1 102 101.333333
2 104 102.857143
3 103 102.933333
4 105 104.000000
5 108 106.031746
Performance Comparison
import numpy as np
import pandas as pd
import time
# Generate large dataset
data = list(range(100_000))
np_data = np.array(data)
pd_series = pd.Series(data)
window = 50
# Standard Python
start = time.perf_counter()
result1 = [sum(data[i:i+window])/window for i in range(len(data)-window+1)]
print(f"List comprehension: {time.perf_counter() - start:.4f}s")
# NumPy
start = time.perf_counter()
kernel = np.ones(window) / window
result2 = np.convolve(np_data, kernel, mode='valid')
print(f"NumPy convolve: {time.perf_counter() - start:.4f}s")
# Pandas
start = time.perf_counter()
result3 = pd_series.rolling(window=window).mean()
print(f"Pandas rolling: {time.perf_counter() - start:.4f}s")
Output:
List comprehension: 0.0818s
NumPy convolve: 0.0023s
Pandas rolling: 0.0030s
Typical Output:
List comprehension: 0.8234s
NumPy convolve: 0.0021s
Pandas rolling: 0.0035s
Method Comparison
| Method | Best For | Pros | Cons |
|---|---|---|---|
| List Comprehension | Small lists, coding challenges | Zero dependencies | Slow for large data |
NumPy convolve | Large numerical arrays | Extremely fast | Requires NumPy |
NumPy cumsum | Very large arrays | Fastest option | Slightly complex |
Pandas rolling | DataFrames, time-series | Handles NaN, flexible | Overhead for simple cases |
Summary
- Use list comprehension for simple scripts or when you cannot install packages.
- Use NumPy when performance matters and you're working with numerical arrays.
- Use Pandas for real-world data analysis, especially with time-series or CSV data.