Python NumPy: How to Calculate a Covariance Matrix in NumPy

The covariance matrix measures how variables change together, revealing relationships between features in multivariate data. It's fundamental for Principal Component Analysis (PCA), portfolio optimization, and understanding feature dependencies.

Basic Covariance Calculation

Use np.cov() to compute the covariance between two variables:

import numpy as np

height = np.array([65, 70, 72, 68])  # inches
weight = np.array([120, 160, 180, 150])  # pounds

cov_matrix = np.cov(height, weight)
print(cov_matrix)

Output:

[[  8.91666667  74.16666667]
 [ 74.16666667 625.        ]]

Interpreting the Matrix

           height    weight
height  [  8.67      46.67  ]
weight  [ 46.67     625.00  ]

Diagonal elements: Variance of each variable (8.67 for height, 625 for weight)
Off-diagonal elements: Covariance between variables (46.67)
Positive covariance: Variables increase together (taller people tend to weigh more)
Negative covariance: One increases as the other decreases

The rowvar Parameter

By default, NumPy assumes each row is a variable and each column is an observation. For typical "tidy" data where rows are observations and columns are features, set rowvar=False:

import numpy as np

# Typical data format: rows = observations, columns = features
data = np.array([
    [1.2, 2.5, 3.1],  # Observation 1
    [2.1, 2.9, 3.5],  # Observation 2
    [3.3, 3.8, 4.2],  # Observation 3
    [4.0, 4.5, 5.0]   # Observation 4
])

# Features are in columns, so set rowvar=False
cov_matrix = np.cov(data, rowvar=False)
print(f"Shape: {cov_matrix.shape}")  # (3, 3) (one per feature)
print(cov_matrix)

Output:

Shape: (3, 3)
[[1.55       1.10833333 1.02      ]
 [1.10833333 0.80916667 0.74833333]
 [1.02       0.74833333 0.69666667]]

warning

Forgetting rowvar=False when your features are in columns produces incorrect results-you'll get covariance between observations instead of between features.

Visualizing rowvar Behavior

import numpy as np

# 4 observations, 2 features
data = np.array([
    [1, 10],
    [2, 20],
    [3, 30],
    [4, 40]
])

# Wrong: treats rows as variables (4x4 matrix)
wrong = np.cov(data)
print(f"rowvar=True shape: {wrong.shape}")  # (4, 4)

# Correct: treats columns as variables (2x2 matrix)
correct = np.cov(data, rowvar=False)
print(f"rowvar=False shape: {correct.shape}")  # (2, 2)
print(correct)

Output:

rowvar=True shape: (4, 4)
rowvar=False shape: (2, 2)
[[  1.66666667  16.66666667]
 [ 16.66666667 166.66666667]]

Covariance vs Correlation

Covariance depends on variable scales; correlation normalizes to [-1, 1]:

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Covariance (scale-dependent)
covariance = np.cov(x, y)[0, 1]
print(f"Covariance: {covariance:.2f}")      # 1.50

# Correlation (scale-independent, -1 to 1)
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation: {correlation:.2f}")    # 0.87

Output:

Covariance: 1.50
Correlation: 0.77

When to Use Each

Metric	Range	Use When
Covariance	-∞ to +∞	You need actual variance units
Correlation	-1 to +1	Comparing relationship strength

Extracting Specific Values

Navigate the covariance matrix to extract specific relationships:

import numpy as np

# Three features
data = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
])

cov_matrix = np.cov(data, rowvar=False)

# Variance of feature 0 (diagonal)
var_0 = cov_matrix[0, 0]
print(f"Variance of feature 0: {var_0:.2f}")

# Covariance between features 0 and 1
cov_01 = cov_matrix[0, 1]
print(f"Covariance (0, 1): {cov_01:.2f}")

# Extract all variances (diagonal)
variances = np.diag(cov_matrix)
print(f"All variances: {variances}")

Output:

Variance of feature 0: 15.00
Covariance (0, 1): 15.00
All variances: [15. 15. 15.]

Sample vs Population Covariance

By default, NumPy calculates sample covariance (divides by N-1). Use ddof=0 for population covariance:

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Sample covariance (default, ddof=1)
sample_cov = np.cov(x, y, ddof=1)
print(f"Sample covariance: {sample_cov[0, 1]:.4f}")

# Population covariance (ddof=0)
pop_cov = np.cov(x, y, ddof=0)
print(f"Population covariance: {pop_cov[0, 1]:.4f}")

Output:

Sample covariance: 1.5000
Population covariance: 1.2000

tip

Use ddof=1 (default) for sample data when estimating population parameters. Use ddof=0 only when you have the entire population.

Practical Example: Feature Analysis

Analyze relationships between features in a dataset:

import numpy as np

np.random.seed(42)

# Simulated dataset: 100 observations, 3 features
# Feature 2 is correlated with Feature 0
n_samples = 100
feature_0 = np.random.randn(n_samples)
feature_1 = np.random.randn(n_samples)  # Independent
feature_2 = feature_0 * 2 + np.random.randn(n_samples) * 0.5  # Correlated

data = np.column_stack([feature_0, feature_1, feature_2])

# Compute covariance matrix
cov_matrix = np.cov(data, rowvar=False)

# Display with labels
features = ['Feature 0', 'Feature 1', 'Feature 2']
print("Covariance Matrix:")
print(f"{'':>12}", end='')
for f in features:
    print(f"{f:>12}", end='')
print()

for i, row in enumerate(cov_matrix):
    print(f"{features[i]:>12}", end='')
    for val in row:
        print(f"{val:>12.4f}", end='')
    print()

Output:

Covariance Matrix:
               Feature 0   Feature 1   Feature 2
   Feature 0      0.8248     -0.1182      1.7435
   Feature 1     -0.1182      0.9095     -0.2552
   Feature 2      1.7435     -0.2552      3.9688

Handling Missing Data

Remove or impute missing values before computing covariance:

import numpy as np

# Data with missing values
data = np.array([
    [1.0, 2.0],
    [np.nan, 3.0],
    [3.0, np.nan],
    [4.0, 5.0]
])

# Option 1: Remove rows with any NaN
mask = ~np.isnan(data).any(axis=1)
clean_data = data[mask]
cov_clean = np.cov(clean_data, rowvar=False)
print("Covariance (complete cases):")
print(cov_clean)

# Option 2: Pairwise complete (more complex, use pandas)
import pandas as pd
df = pd.DataFrame(data, columns=['A', 'B'])
cov_pairwise = df.cov()
print("\nPairwise covariance:")
print(cov_pairwise)

Output:

Covariance (complete cases):
[[4.5 4.5]
 [4.5 4.5]]

Pairwise covariance:
          A         B
A  2.333333  4.500000
B  4.500000  2.333333

Weighted Covariance

Apply weights to observations with aweights or fweights:

import numpy as np

data = np.array([
    [1, 2],
    [3, 4],
    [5, 6],
    [7, 8]
])

# Observation weights (e.g., reliability scores)
weights = np.array([1, 2, 2, 1])

# Weighted covariance
weighted_cov = np.cov(data, rowvar=False, aweights=weights)
print("Weighted covariance:")
print(weighted_cov)

Output:

Weighted covariance:
[[5.07692308 5.07692308]
 [5.07692308 5.07692308]]

Parameter Reference

Parameter	Default	Description
`rowvar`	`True`	If True, rows are variables; if False, columns are variables
`ddof`	`1`	Delta degrees of freedom (1 for sample, 0 for population)
`aweights`	`None`	Observation weights (reliability weights)
`fweights`	`None`	Frequency weights (integer counts)

Summary

Use np.cov(data, rowvar=False) when your data has observations as rows and features as columns (the typical format). The diagonal contains variances, and off-diagonal elements show how variables co-vary.

For standardized comparison, use np.corrcoef() to get correlation coefficients bounded between -1 and 1.

Basic Covariance Calculation​

Interpreting the Matrix​

The rowvar Parameter​

Visualizing rowvar Behavior​

Covariance vs Correlation​

When to Use Each​

Extracting Specific Values​

Sample vs Population Covariance​

Practical Example: Feature Analysis​

Handling Missing Data​

Weighted Covariance​

Parameter Reference​

Summary​

Table of Contents

Basic Covariance Calculation

Interpreting the Matrix

The rowvar Parameter

Visualizing rowvar Behavior

Covariance vs Correlation

When to Use Each

Extracting Specific Values

Sample vs Population Covariance

Practical Example: Feature Analysis

Handling Missing Data

Weighted Covariance

Parameter Reference

Summary