Skip to main content

Python NumPy: How to Calculate Array Statistics

Statistical analysis is the backbone of data science and scientific computing. In Python, while standard lists can store numbers, they lack the performance and built-in functions required for efficient statistical analysis. The NumPy library bridges this gap, providing high-performance array objects and a suite of statistical tools.

This guide explores how to create arrays, calculate core descriptive statistics (like mean and variance), handle multi-dimensional data, and perform advanced analysis like normalization.

Creating and Understanding NumPy Arrays

Before calculating statistics, you need to structure your data into NumPy arrays. These arrays are more memory-efficient and faster than Python lists for numerical operations.

Basic Array Creation

You can convert lists to arrays or generate data using built-in methods.

import numpy as np

# 1. From a Python List
data_list = [10, 20, 30, 40, 50]
arr = np.array(data_list)

# 2. Generating Data
# arange: Start at 0, stop before 10, step by 2
range_arr = np.arange(0, 10, 2)

# linspace: 5 evenly spaced numbers between 0 and 1
linear_arr = np.linspace(0, 1, 5)

print(f"Manual Array: {arr}")
print(f"Range Array: {range_arr}")
print(f"Linear Array: {linear_arr}")

Output:

Manual Array: [10 20 30 40 50]
Range Array: [0 2 4 6 8]
Linear Array: [0. 0.25 0.5 0.75 1. ]

Calculating Basic Descriptive Statistics

NumPy provides functions to calculate Central Tendency (where the data is centered) and Dispersion (how spread out the data is).

import numpy as np

data = np.array([15, 20, 35, 40, 50, 12, 45, 55])

# Central Tendency
mean_val = np.mean(data)
median_val = np.median(data)

# Dispersion
std_dev = np.std(data) # Standard Deviation
variance = np.var(data) # Variance
min_val = np.min(data)
max_val = np.max(data)

print(f"Mean: {mean_val:.2f}")
print(f"Median: {median_val:.2f}")
print(f"Std Dev: {std_dev:.2f}")
print(f"Range: {min_val} - {max_val}")

Output:

Mean:     34.00
Median: 37.50
Std Dev: 15.39
Range: 12 - 55
tip

Median vs. Mean: Use the median when your data has outliers (extreme values), as the mean is highly sensitive to them.

Working with Multi-Dimensional Arrays (Axes)

When working with matrices (2D arrays), you often need to calculate statistics for the entire dataset, or specifically for rows/columns. In NumPy, this is controlled by the axis parameter.

  • axis=0: Calculate down the columns.
  • axis=1: Calculate across the rows.
import numpy as np

# 3 rows, 3 columns
matrix = np.array([
[10, 20, 30],
[5, 15, 25],
[2, 4, 6]
])

# 1. Global Mean (All elements)
print(f"Global Mean: {np.mean(matrix):.2f}")

# 2. Column Mean (axis=0) -> Collapses rows
print(f"Column Means: {np.mean(matrix, axis=0)}")

# 3. Row Mean (axis=1) -> Collapses columns
print(f"Row Means: {np.mean(matrix, axis=1)}")

Output:

Global Mean: 13.00
Column Means: [ 5.66666667 13. 20.33333333]
Row Means: [20. 15. 4.]

Advanced Analysis: Percentiles and Correlation

Beyond averages, you often need to understand the distribution of data or how variables relate to one another.

Percentiles

Percentiles help identify where a value stands relative to the rest of the data.

import numpy as np

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# 25th, 50th (median), and 75th percentiles
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)
q3 = np.percentile(data, 75)

print(f"25th Percentile: {q1}")
print(f"50th Percentile: {q2}")
print(f"75th Percentile: {q3}")

Output:

25th Percentile: 3.25
50th Percentile: 5.5
75th Percentile: 7.75

Correlation

Use np.corrcoef to see if two datasets move in the same direction.

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10]) # Perfect positive correlation

# Returns a matrix
corr_matrix = np.corrcoef(x, y)
print(f"Correlation Matrix:\n{corr_matrix}")

Output:

Correlation Matrix:
[[1. 1.]
[1. 1.]]

Handling Missing Data and Normalization

Real-world data is often messy or requires scaling before it can be used in Machine Learning algorithms.

Handling NaN (Not a Number)

Standard functions return nan if the array contains missing values. Use nan-safe functions instead.

import numpy as np

dirty_data = np.array([10, 20, np.nan, 40, 50])

# ⛔️ Incorrect: Result will be nan
print(f"Standard Mean: {np.mean(dirty_data)}")

# ✅ Correct: Ignores nans
print(f"Safe Mean: {np.nanmean(dirty_data)}")

Output:

Standard Mean: nan
Safe Mean: 30.0

Z-Score Normalization

Standardizing data (mean of 0, standard deviation of 1) is a common preprocessing step.

import numpy as np

data = np.array([10, 20, 30, 40, 50])

def normalize_z_score(arr):
return (arr - np.mean(arr)) / np.std(arr)

normalized = normalize_z_score(data)
print(f"Normalized Data:\n{normalized}")

Output:

Normalized Data:
[-1.41421356 -0.70710678 0. 0.70710678 1.41421356]
note

If np.std(arr) is 0 (all numbers are the same), division by zero will occur. Always validate your data variance before normalization.

Conclusion

Calculating statistics in Python using NumPy is both concise and performant.

  1. Initialize your data using np.array.
  2. Analyze basics using np.mean, np.median, and np.std.
  3. Control Dimensions using the axis parameter for matrices.
  4. Handle Errors using np.nanmean functions to account for missing data.