Skip to main content

Python NumPy: How to Calculate a Covariance Matrix in NumPy

The covariance matrix measures how variables change together, revealing relationships between features in multivariate data. It's fundamental for Principal Component Analysis (PCA), portfolio optimization, and understanding feature dependencies.

Basic Covariance Calculation

Use np.cov() to compute the covariance between two variables:

import numpy as np

height = np.array([65, 70, 72, 68]) # inches
weight = np.array([120, 160, 180, 150]) # pounds

cov_matrix = np.cov(height, weight)
print(cov_matrix)

Output:

[[  8.91666667  74.16666667]
[ 74.16666667 625. ]]

Interpreting the Matrix

           height    weight
height [ 8.67 46.67 ]
weight [ 46.67 625.00 ]
  • Diagonal elements: Variance of each variable (8.67 for height, 625 for weight)
  • Off-diagonal elements: Covariance between variables (46.67)
  • Positive covariance: Variables increase together (taller people tend to weigh more)
  • Negative covariance: One increases as the other decreases

The rowvar Parameter

By default, NumPy assumes each row is a variable and each column is an observation. For typical "tidy" data where rows are observations and columns are features, set rowvar=False:

import numpy as np

# Typical data format: rows = observations, columns = features
data = np.array([
[1.2, 2.5, 3.1], # Observation 1
[2.1, 2.9, 3.5], # Observation 2
[3.3, 3.8, 4.2], # Observation 3
[4.0, 4.5, 5.0] # Observation 4
])

# Features are in columns, so set rowvar=False
cov_matrix = np.cov(data, rowvar=False)
print(f"Shape: {cov_matrix.shape}") # (3, 3) (one per feature)
print(cov_matrix)

Output:

Shape: (3, 3)
[[1.55 1.10833333 1.02 ]
[1.10833333 0.80916667 0.74833333]
[1.02 0.74833333 0.69666667]]
warning

Forgetting rowvar=False when your features are in columns produces incorrect results-you'll get covariance between observations instead of between features.

Visualizing rowvar Behavior

import numpy as np

# 4 observations, 2 features
data = np.array([
[1, 10],
[2, 20],
[3, 30],
[4, 40]
])

# Wrong: treats rows as variables (4x4 matrix)
wrong = np.cov(data)
print(f"rowvar=True shape: {wrong.shape}") # (4, 4)

# Correct: treats columns as variables (2x2 matrix)
correct = np.cov(data, rowvar=False)
print(f"rowvar=False shape: {correct.shape}") # (2, 2)
print(correct)

Output:

rowvar=True shape: (4, 4)
rowvar=False shape: (2, 2)
[[ 1.66666667 16.66666667]
[ 16.66666667 166.66666667]]

Covariance vs Correlation

Covariance depends on variable scales; correlation normalizes to [-1, 1]:

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Covariance (scale-dependent)
covariance = np.cov(x, y)[0, 1]
print(f"Covariance: {covariance:.2f}") # 1.50

# Correlation (scale-independent, -1 to 1)
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation: {correlation:.2f}") # 0.87

Output:

Covariance: 1.50
Correlation: 0.77

When to Use Each

MetricRangeUse When
Covariance-∞ to +∞You need actual variance units
Correlation-1 to +1Comparing relationship strength

Extracting Specific Values

Navigate the covariance matrix to extract specific relationships:

import numpy as np

# Three features
data = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]
])

cov_matrix = np.cov(data, rowvar=False)

# Variance of feature 0 (diagonal)
var_0 = cov_matrix[0, 0]
print(f"Variance of feature 0: {var_0:.2f}")

# Covariance between features 0 and 1
cov_01 = cov_matrix[0, 1]
print(f"Covariance (0, 1): {cov_01:.2f}")

# Extract all variances (diagonal)
variances = np.diag(cov_matrix)
print(f"All variances: {variances}")

Output:

Variance of feature 0: 15.00
Covariance (0, 1): 15.00
All variances: [15. 15. 15.]

Sample vs Population Covariance

By default, NumPy calculates sample covariance (divides by N-1). Use ddof=0 for population covariance:

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Sample covariance (default, ddof=1)
sample_cov = np.cov(x, y, ddof=1)
print(f"Sample covariance: {sample_cov[0, 1]:.4f}")

# Population covariance (ddof=0)
pop_cov = np.cov(x, y, ddof=0)
print(f"Population covariance: {pop_cov[0, 1]:.4f}")

Output:

Sample covariance: 1.5000
Population covariance: 1.2000
tip

Use ddof=1 (default) for sample data when estimating population parameters. Use ddof=0 only when you have the entire population.

Practical Example: Feature Analysis

Analyze relationships between features in a dataset:

import numpy as np

np.random.seed(42)

# Simulated dataset: 100 observations, 3 features
# Feature 2 is correlated with Feature 0
n_samples = 100
feature_0 = np.random.randn(n_samples)
feature_1 = np.random.randn(n_samples) # Independent
feature_2 = feature_0 * 2 + np.random.randn(n_samples) * 0.5 # Correlated

data = np.column_stack([feature_0, feature_1, feature_2])

# Compute covariance matrix
cov_matrix = np.cov(data, rowvar=False)

# Display with labels
features = ['Feature 0', 'Feature 1', 'Feature 2']
print("Covariance Matrix:")
print(f"{'':>12}", end='')
for f in features:
print(f"{f:>12}", end='')
print()

for i, row in enumerate(cov_matrix):
print(f"{features[i]:>12}", end='')
for val in row:
print(f"{val:>12.4f}", end='')
print()

Output:

Covariance Matrix:
Feature 0 Feature 1 Feature 2
Feature 0 0.8248 -0.1182 1.7435
Feature 1 -0.1182 0.9095 -0.2552
Feature 2 1.7435 -0.2552 3.9688

Handling Missing Data

Remove or impute missing values before computing covariance:

import numpy as np

# Data with missing values
data = np.array([
[1.0, 2.0],
[np.nan, 3.0],
[3.0, np.nan],
[4.0, 5.0]
])

# Option 1: Remove rows with any NaN
mask = ~np.isnan(data).any(axis=1)
clean_data = data[mask]
cov_clean = np.cov(clean_data, rowvar=False)
print("Covariance (complete cases):")
print(cov_clean)

# Option 2: Pairwise complete (more complex, use pandas)
import pandas as pd
df = pd.DataFrame(data, columns=['A', 'B'])
cov_pairwise = df.cov()
print("\nPairwise covariance:")
print(cov_pairwise)

Output:

Covariance (complete cases):
[[4.5 4.5]
[4.5 4.5]]

Pairwise covariance:
A B
A 2.333333 4.500000
B 4.500000 2.333333

Weighted Covariance

Apply weights to observations with aweights or fweights:

import numpy as np

data = np.array([
[1, 2],
[3, 4],
[5, 6],
[7, 8]
])

# Observation weights (e.g., reliability scores)
weights = np.array([1, 2, 2, 1])

# Weighted covariance
weighted_cov = np.cov(data, rowvar=False, aweights=weights)
print("Weighted covariance:")
print(weighted_cov)

Output:

Weighted covariance:
[[5.07692308 5.07692308]
[5.07692308 5.07692308]]

Parameter Reference

ParameterDefaultDescription
rowvarTrueIf True, rows are variables; if False, columns are variables
ddof1Delta degrees of freedom (1 for sample, 0 for population)
aweightsNoneObservation weights (reliability weights)
fweightsNoneFrequency weights (integer counts)

Summary

Use np.cov(data, rowvar=False) when your data has observations as rows and features as columns (the typical format). The diagonal contains variances, and off-diagonal elements show how variables co-vary.

For standardized comparison, use np.corrcoef() to get correlation coefficients bounded between -1 and 1.