Python NumPy: How to Calculate Array Statistics
Statistical analysis is the backbone of data science and scientific computing. In Python, while standard lists can store numbers, they lack the performance and built-in functions required for efficient statistical analysis. The NumPy library bridges this gap, providing high-performance array objects and a suite of statistical tools.
This guide explores how to create arrays, calculate core descriptive statistics (like mean and variance), handle multi-dimensional data, and perform advanced analysis like normalization.
Creating and Understanding NumPy Arrays
Before calculating statistics, you need to structure your data into NumPy arrays. These arrays are more memory-efficient and faster than Python lists for numerical operations.
Basic Array Creation
You can convert lists to arrays or generate data using built-in methods.
import numpy as np
# 1. From a Python List
data_list = [10, 20, 30, 40, 50]
arr = np.array(data_list)
# 2. Generating Data
# arange: Start at 0, stop before 10, step by 2
range_arr = np.arange(0, 10, 2)
# linspace: 5 evenly spaced numbers between 0 and 1
linear_arr = np.linspace(0, 1, 5)
print(f"Manual Array: {arr}")
print(f"Range Array: {range_arr}")
print(f"Linear Array: {linear_arr}")
Output:
Manual Array: [10 20 30 40 50]
Range Array: [0 2 4 6 8]
Linear Array: [0. 0.25 0.5 0.75 1. ]
Calculating Basic Descriptive Statistics
NumPy provides functions to calculate Central Tendency (where the data is centered) and Dispersion (how spread out the data is).
import numpy as np
data = np.array([15, 20, 35, 40, 50, 12, 45, 55])
# Central Tendency
mean_val = np.mean(data)
median_val = np.median(data)
# Dispersion
std_dev = np.std(data) # Standard Deviation
variance = np.var(data) # Variance
min_val = np.min(data)
max_val = np.max(data)
print(f"Mean: {mean_val:.2f}")
print(f"Median: {median_val:.2f}")
print(f"Std Dev: {std_dev:.2f}")
print(f"Range: {min_val} - {max_val}")
Output:
Mean: 34.00
Median: 37.50
Std Dev: 15.39
Range: 12 - 55
Median vs. Mean: Use the median when your data has outliers (extreme values), as the mean is highly sensitive to them.
Working with Multi-Dimensional Arrays (Axes)
When working with matrices (2D arrays), you often need to calculate statistics for the entire dataset, or specifically for rows/columns. In NumPy, this is controlled by the axis parameter.
axis=0: Calculate down the columns.axis=1: Calculate across the rows.
import numpy as np
# 3 rows, 3 columns
matrix = np.array([
[10, 20, 30],
[5, 15, 25],
[2, 4, 6]
])
# 1. Global Mean (All elements)
print(f"Global Mean: {np.mean(matrix):.2f}")
# 2. Column Mean (axis=0) -> Collapses rows
print(f"Column Means: {np.mean(matrix, axis=0)}")
# 3. Row Mean (axis=1) -> Collapses columns
print(f"Row Means: {np.mean(matrix, axis=1)}")
Output:
Global Mean: 13.00
Column Means: [ 5.66666667 13. 20.33333333]
Row Means: [20. 15. 4.]
Advanced Analysis: Percentiles and Correlation
Beyond averages, you often need to understand the distribution of data or how variables relate to one another.
Percentiles
Percentiles help identify where a value stands relative to the rest of the data.
import numpy as np
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# 25th, 50th (median), and 75th percentiles
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)
q3 = np.percentile(data, 75)
print(f"25th Percentile: {q1}")
print(f"50th Percentile: {q2}")
print(f"75th Percentile: {q3}")
Output:
25th Percentile: 3.25
50th Percentile: 5.5
75th Percentile: 7.75
Correlation
Use np.corrcoef to see if two datasets move in the same direction.
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10]) # Perfect positive correlation
# Returns a matrix
corr_matrix = np.corrcoef(x, y)
print(f"Correlation Matrix:\n{corr_matrix}")
Output:
Correlation Matrix:
[[1. 1.]
[1. 1.]]
Handling Missing Data and Normalization
Real-world data is often messy or requires scaling before it can be used in Machine Learning algorithms.
Handling NaN (Not a Number)
Standard functions return nan if the array contains missing values. Use nan-safe functions instead.
import numpy as np
dirty_data = np.array([10, 20, np.nan, 40, 50])
# ⛔️ Incorrect: Result will be nan
print(f"Standard Mean: {np.mean(dirty_data)}")
# ✅ Correct: Ignores nans
print(f"Safe Mean: {np.nanmean(dirty_data)}")
Output:
Standard Mean: nan
Safe Mean: 30.0
Z-Score Normalization
Standardizing data (mean of 0, standard deviation of 1) is a common preprocessing step.
import numpy as np
data = np.array([10, 20, 30, 40, 50])
def normalize_z_score(arr):
return (arr - np.mean(arr)) / np.std(arr)
normalized = normalize_z_score(data)
print(f"Normalized Data:\n{normalized}")
Output:
Normalized Data:
[-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
If np.std(arr) is 0 (all numbers are the same), division by zero will occur. Always validate your data variance before normalization.
Conclusion
Calculating statistics in Python using NumPy is both concise and performant.
- Initialize your data using
np.array. - Analyze basics using
np.mean,np.median, andnp.std. - Control Dimensions using the
axisparameter for matrices. - Handle Errors using
np.nanmeanfunctions to account for missing data.