Python NumPy: How to Calculate a Covariance Matrix in NumPy
The covariance matrix measures how variables change together, revealing relationships between features in multivariate data. It's fundamental for Principal Component Analysis (PCA), portfolio optimization, and understanding feature dependencies.
Basic Covariance Calculation
Use np.cov() to compute the covariance between two variables:
import numpy as np
height = np.array([65, 70, 72, 68]) # inches
weight = np.array([120, 160, 180, 150]) # pounds
cov_matrix = np.cov(height, weight)
print(cov_matrix)
Output:
[[ 8.91666667 74.16666667]
[ 74.16666667 625. ]]
Interpreting the Matrix
height weight
height [ 8.67 46.67 ]
weight [ 46.67 625.00 ]
- Diagonal elements: Variance of each variable (8.67 for height, 625 for weight)
- Off-diagonal elements: Covariance between variables (46.67)
- Positive covariance: Variables increase together (taller people tend to weigh more)
- Negative covariance: One increases as the other decreases
The rowvar Parameter
By default, NumPy assumes each row is a variable and each column is an observation. For typical "tidy" data where rows are observations and columns are features, set rowvar=False:
import numpy as np
# Typical data format: rows = observations, columns = features
data = np.array([
[1.2, 2.5, 3.1], # Observation 1
[2.1, 2.9, 3.5], # Observation 2
[3.3, 3.8, 4.2], # Observation 3
[4.0, 4.5, 5.0] # Observation 4
])
# Features are in columns, so set rowvar=False
cov_matrix = np.cov(data, rowvar=False)
print(f"Shape: {cov_matrix.shape}") # (3, 3) (one per feature)
print(cov_matrix)
Output:
Shape: (3, 3)
[[1.55 1.10833333 1.02 ]
[1.10833333 0.80916667 0.74833333]
[1.02 0.74833333 0.69666667]]
Forgetting rowvar=False when your features are in columns produces incorrect results-you'll get covariance between observations instead of between features.
Visualizing rowvar Behavior
import numpy as np
# 4 observations, 2 features
data = np.array([
[1, 10],
[2, 20],
[3, 30],
[4, 40]
])
# Wrong: treats rows as variables (4x4 matrix)
wrong = np.cov(data)
print(f"rowvar=True shape: {wrong.shape}") # (4, 4)
# Correct: treats columns as variables (2x2 matrix)
correct = np.cov(data, rowvar=False)
print(f"rowvar=False shape: {correct.shape}") # (2, 2)
print(correct)
Output:
rowvar=True shape: (4, 4)
rowvar=False shape: (2, 2)
[[ 1.66666667 16.66666667]
[ 16.66666667 166.66666667]]
Covariance vs Correlation
Covariance depends on variable scales; correlation normalizes to [-1, 1]:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Covariance (scale-dependent)
covariance = np.cov(x, y)[0, 1]
print(f"Covariance: {covariance:.2f}") # 1.50
# Correlation (scale-independent, -1 to 1)
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation: {correlation:.2f}") # 0.87
Output:
Covariance: 1.50
Correlation: 0.77
When to Use Each
| Metric | Range | Use When |
|---|---|---|
| Covariance | -∞ to +∞ | You need actual variance units |
| Correlation | -1 to +1 | Comparing relationship strength |
Extracting Specific Values
Navigate the covariance matrix to extract specific relationships:
import numpy as np
# Three features
data = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]
])
cov_matrix = np.cov(data, rowvar=False)
# Variance of feature 0 (diagonal)
var_0 = cov_matrix[0, 0]
print(f"Variance of feature 0: {var_0:.2f}")
# Covariance between features 0 and 1
cov_01 = cov_matrix[0, 1]
print(f"Covariance (0, 1): {cov_01:.2f}")
# Extract all variances (diagonal)
variances = np.diag(cov_matrix)
print(f"All variances: {variances}")
Output:
Variance of feature 0: 15.00
Covariance (0, 1): 15.00
All variances: [15. 15. 15.]
Sample vs Population Covariance
By default, NumPy calculates sample covariance (divides by N-1). Use ddof=0 for population covariance:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Sample covariance (default, ddof=1)
sample_cov = np.cov(x, y, ddof=1)
print(f"Sample covariance: {sample_cov[0, 1]:.4f}")
# Population covariance (ddof=0)
pop_cov = np.cov(x, y, ddof=0)
print(f"Population covariance: {pop_cov[0, 1]:.4f}")
Output:
Sample covariance: 1.5000
Population covariance: 1.2000
Use ddof=1 (default) for sample data when estimating population parameters. Use ddof=0 only when you have the entire population.
Practical Example: Feature Analysis
Analyze relationships between features in a dataset:
import numpy as np
np.random.seed(42)
# Simulated dataset: 100 observations, 3 features
# Feature 2 is correlated with Feature 0
n_samples = 100
feature_0 = np.random.randn(n_samples)
feature_1 = np.random.randn(n_samples) # Independent
feature_2 = feature_0 * 2 + np.random.randn(n_samples) * 0.5 # Correlated
data = np.column_stack([feature_0, feature_1, feature_2])
# Compute covariance matrix
cov_matrix = np.cov(data, rowvar=False)
# Display with labels
features = ['Feature 0', 'Feature 1', 'Feature 2']
print("Covariance Matrix:")
print(f"{'':>12}", end='')
for f in features:
print(f"{f:>12}", end='')
print()
for i, row in enumerate(cov_matrix):
print(f"{features[i]:>12}", end='')
for val in row:
print(f"{val:>12.4f}", end='')
print()
Output:
Covariance Matrix:
Feature 0 Feature 1 Feature 2
Feature 0 0.8248 -0.1182 1.7435
Feature 1 -0.1182 0.9095 -0.2552
Feature 2 1.7435 -0.2552 3.9688
Handling Missing Data
Remove or impute missing values before computing covariance:
import numpy as np
# Data with missing values
data = np.array([
[1.0, 2.0],
[np.nan, 3.0],
[3.0, np.nan],
[4.0, 5.0]
])
# Option 1: Remove rows with any NaN
mask = ~np.isnan(data).any(axis=1)
clean_data = data[mask]
cov_clean = np.cov(clean_data, rowvar=False)
print("Covariance (complete cases):")
print(cov_clean)
# Option 2: Pairwise complete (more complex, use pandas)
import pandas as pd
df = pd.DataFrame(data, columns=['A', 'B'])
cov_pairwise = df.cov()
print("\nPairwise covariance:")
print(cov_pairwise)
Output:
Covariance (complete cases):
[[4.5 4.5]
[4.5 4.5]]
Pairwise covariance:
A B
A 2.333333 4.500000
B 4.500000 2.333333
Weighted Covariance
Apply weights to observations with aweights or fweights:
import numpy as np
data = np.array([
[1, 2],
[3, 4],
[5, 6],
[7, 8]
])
# Observation weights (e.g., reliability scores)
weights = np.array([1, 2, 2, 1])
# Weighted covariance
weighted_cov = np.cov(data, rowvar=False, aweights=weights)
print("Weighted covariance:")
print(weighted_cov)
Output:
Weighted covariance:
[[5.07692308 5.07692308]
[5.07692308 5.07692308]]
Parameter Reference
| Parameter | Default | Description |
|---|---|---|
rowvar | True | If True, rows are variables; if False, columns are variables |
ddof | 1 | Delta degrees of freedom (1 for sample, 0 for population) |
aweights | None | Observation weights (reliability weights) |
fweights | None | Frequency weights (integer counts) |
Summary
Use np.cov(data, rowvar=False) when your data has observations as rows and features as columns (the typical format). The diagonal contains variances, and off-diagonal elements show how variables co-vary.
For standardized comparison, use np.corrcoef() to get correlation coefficients bounded between -1 and 1.