Python NumPy: How to Generate Chi-Square Distributions in NumPy
The chi-square (χ²) distribution is fundamental in statistical testing, particularly for goodness-of-fit tests, independence tests, and confidence interval estimation. NumPy provides efficient methods to generate random samples from this distribution.
Using the Modern Generator API
The recommended approach uses NumPy's modern random generator:
import numpy as np
rng = np.random.default_rng(seed=42) # Optional seed for reproducibility
# Generate 5 samples with 3 degrees of freedom
samples = rng.chisquare(df=3, size=5)
print(samples)
Output:
[3.05543949 4.35892866 2.6205389 2.29722655 4.79524961]
Understanding Degrees of Freedom
The df parameter (degrees of freedom) controls the distribution's shape:
import numpy as np
rng = np.random.default_rng()
# Low df: heavily right-skewed
low_df = rng.chisquare(df=1, size=5)
print(f"df=1: mean={low_df.mean():.2f}")
# Medium df: moderately skewed
mid_df = rng.chisquare(df=5, size=5)
print(f"df=5: mean={mid_df.mean():.2f}")
# High df: approaches normal distribution
high_df = rng.chisquare(df=30, size=5)
print(f"df=30: mean={high_df.mean():.2f}")
Output:
df=1: mean=0.42
df=5: mean=5.07
df=30: mean=28.63
The mean of a chi-square distribution equals its degrees of freedom (df), and the variance equals 2×df. As df increases, the distribution becomes more symmetric.
Multi-dimensional Output
Generate matrices or higher-dimensional arrays of chi-square samples:
import numpy as np
rng = np.random.default_rng()
# 3x3 matrix
matrix = rng.chisquare(df=5, size=(3, 3))
print(matrix.shape) # (3, 3)
print(matrix)
# 3D array: 2 layers, 3 rows, 4 columns
tensor = rng.chisquare(df=10, size=(2, 3, 4))
print(tensor.shape) # (2, 3, 4)
Output:
(3, 3)
[[ 2.68470397 4.50259988 3.39465421]
[ 5.0783715 1.16429817 3.38975043]
[ 5.70560665 1.39477394 14.24344514]]
(2, 3, 4)
Visualizing Different Degrees of Freedom
Compare how the distribution shape changes with different df values:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=42)
fig, ax = plt.subplots(figsize=(10, 6))
degrees = [1, 2, 5, 10, 20]
colors = ['red', 'blue', 'green', 'orange', 'purple']
for df, color in zip(degrees, colors):
samples = rng.chisquare(df=df, size=10000)
ax.hist(samples, bins=100, density=True, alpha=0.5,
histtype='stepfilled', label=f'df={df}', color=color)
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.set_title('Chi-Square Distribution for Various Degrees of Freedom')
ax.legend()
ax.set_xlim(0, 40)
plt.show()
Statistical Properties
Verify theoretical properties with samples:
import numpy as np
rng = np.random.default_rng()
df = 10
samples = rng.chisquare(df=df, size=100000)
print(f"Degrees of freedom: {df}")
print(f"Theoretical mean: {df}")
print(f"Sample mean: {samples.mean():.2f}")
print(f"Theoretical variance: {2 * df}")
print(f"Sample variance: {samples.var():.2f}")
Output:
Degrees of freedom: 10
Theoretical mean: 10
Sample mean: 9.99
Theoretical variance: 20
Sample variance: 19.82
Practical Example: Chi-Square Test Simulation
Simulate the chi-square test statistic under the null hypothesis:
import numpy as np
def simulate_chi_square_test(observed, expected, n_simulations=10000):
"""Simulate chi-square test statistic distribution."""
rng = np.random.default_rng()
# Calculate observed test statistic
observed_stat = np.sum((observed - expected)**2 / expected)
# Degrees of freedom
df = len(observed) - 1
# Simulate null distribution
null_distribution = rng.chisquare(df=df, size=n_simulations)
# Calculate p-value
p_value = np.mean(null_distribution >= observed_stat)
return observed_stat, p_value
# Example: Testing if a die is fair
observed = np.array([18, 22, 15, 20, 25, 20]) # Observed counts
expected = np.array([20, 20, 20, 20, 20, 20]) # Expected if fair
stat, p_val = simulate_chi_square_test(observed, expected)
print(f"Chi-square statistic: {stat:.2f}")
print(f"Simulated p-value: {p_val:.4f}")
Output:
Chi-square statistic: 2.90
Simulated p-value: 0.7184
Legacy API (For Reference)
The older API still works but is not recommended for new code:
import numpy as np
# Legacy method (avoid in new code)
np.random.seed(42)
samples = np.random.chisquare(df=5, size=10)
# Modern equivalent (preferred)
rng = np.random.default_rng(42)
samples = rng.chisquare(df=5, size=10)
The modern default_rng() API provides better statistical properties, is faster, and supports independent random streams for parallel computing.
Combining with SciPy for Full Analysis
For complete statistical analysis, combine NumPy sampling with SciPy:
import numpy as np
from scipy import stats
rng = np.random.default_rng()
df = 5
samples = rng.chisquare(df=df, size=10000)
# Compare with theoretical distribution
x = np.linspace(0, 20, 100)
theoretical_pdf = stats.chi2.pdf(x, df)
# Calculate percentiles
p95 = stats.chi2.ppf(0.95, df)
print(f"95th percentile (theoretical): {p95:.2f}")
print(f"95th percentile (sample): {np.percentile(samples, 95):.2f}")
Output:
95th percentile (theoretical): 11.07
95th percentile (sample): 11.06
Parameters Reference
| Parameter | Description | Valid Values |
|---|---|---|
df | Degrees of freedom | Positive number (> 0) |
size | Output shape | int, tuple, or None |
Key Properties
| Property | Formula | Example (df=5) |
|---|---|---|
| Mean | df | 5 |
| Variance | 2 × df | 10 |
| Mode | max(df - 2, 0) | 3 |
| Skewness | √(8/df) | 1.26 |
Summary
Use np.random.default_rng().chisquare(df, size) to generate chi-square distributed random samples.
- The degrees of freedom parameter controls the distribution shape-lower values produce heavily right-skewed distributions, while higher values approach normality.
- This distribution is essential for hypothesis testing, particularly when comparing observed frequencies to expected frequencies.