Skip to main content

Python Pandas: How to Create a Correlation Matrix Using Pandas

A correlation matrix is a table that shows the correlation coefficients between multiple variables in a dataset. Each cell represents the strength and direction of the relationship between two variables, with values ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship.

Correlation matrices are essential in data analysis and machine learning for identifying relationships between features, detecting multicollinearity, selecting relevant variables, and understanding data patterns. In this guide, you'll learn multiple methods to create correlation matrices using pandas, NumPy, and SciPy.

Using DataFrame.corr() (Pearson - Default)

The simplest way to create a correlation matrix is using pandas' built-in corr() method, which computes the Pearson correlation coefficient by default:

import pandas as pd

df = pd.DataFrame({
'Temperature': [30, 32, 35, 28, 40],
'Ice_Cream_Sales': [200, 220, 300, 180, 350],
'Heating_Cost': [150, 140, 100, 170, 80]
})

correlation_matrix = df.corr()
print(correlation_matrix)

Output:

                 Temperature  Ice_Cream_Sales  Heating_Cost
Temperature 1.000000 0.983057 -0.979213
Ice_Cream_Sales 0.983057 1.000000 -0.992851
Heating_Cost -0.979213 -0.992851 1.000000

How to interpret this:

  • Temperature ↔ Ice_Cream_Sales: 0.97 - Strong positive correlation. As temperature rises, ice cream sales increase.
  • Temperature ↔ Heating_Cost: -0.99 - Strong negative correlation. As temperature rises, heating costs decrease.
  • Diagonal values are always 1.0 - Every variable perfectly correlates with itself.
Correlation Strength Guide
RangeInterpretation
0.8 to 1.0Very strong
0.6 to 0.8Strong
0.4 to 0.6Moderate
0.2 to 0.4Weak
0.0 to 0.2Very weak / none

The same ranges apply for negative correlations (just with a minus sign).

Using Spearman and Kendall Correlation

Pearson measures linear relationships. For monotonic (consistently increasing or decreasing but not necessarily linear) relationships, use Spearman or Kendall:

Spearman Correlation

Spearman works on rank-ordered data and detects monotonic relationships:

import pandas as pd

df = pd.DataFrame({
'Study_Hours': [1, 2, 3, 4, 5],
'Exam_Score': [50, 55, 70, 80, 95],
'Stress_Level': [8, 7, 5, 4, 2]
})

spearman_corr = df.corr(method='spearman')
print("Spearman Correlation:")
print(spearman_corr)

Output:

Spearman Correlation:
Study_Hours Exam_Score Stress_Level
Study_Hours 1.0 1.0 -1.0
Exam_Score 1.0 1.0 -1.0
Stress_Level -1.0 -1.0 1.0

Kendall Correlation

Kendall is more robust with small sample sizes and outliers:

import pandas as pd

df = pd.DataFrame({
'Study_Hours': [1, 2, 3, 4, 5],
'Exam_Score': [50, 55, 70, 80, 95],
'Stress_Level': [8, 7, 5, 4, 2]
})

kendall_corr = df.corr(method='kendall')
print("Kendall Correlation:")
print(kendall_corr)

Output:

Kendall Correlation:
Study_Hours Exam_Score Stress_Level
Study_Hours 1.0 1.0 -1.0
Exam_Score 1.0 1.0 -1.0
Stress_Level -1.0 -1.0 1.0
When to Use Each Method
  • Pearson: When you expect linear relationships and data is normally distributed.
  • Spearman: When relationships are monotonic but not necessarily linear, or data has outliers.
  • Kendall: When dealing with small datasets or ordinal data, and you want more robust results.

Visualizing the Correlation Matrix with a Heatmap

Numbers alone can be hard to interpret. A heatmap makes patterns immediately visible:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame({
'Temperature': [30, 32, 35, 28, 40, 38, 25, 33],
'Humidity': [80, 75, 60, 85, 50, 55, 90, 70],
'Wind_Speed': [10, 12, 15, 8, 20, 18, 5, 11],
'Rainfall': [5, 4, 1, 6, 0, 1, 8, 3]
})

corr_matrix = df.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(
corr_matrix,
annot=True, # Show correlation values
fmt='.2f', # Two decimal places
cmap='coolwarm', # Color scheme: blue (negative) to red (positive)
center=0, # Center color scale at 0
square=True, # Square cells
linewidths=0.5 # Grid lines
)
plt.title('Correlation Matrix Heatmap')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
plt.show()

The heatmap colors immediately reveal which variables are strongly correlated (dark red or dark blue) and which have no relationship (white/light colors).

Using NumPy for Correlation

numpy.corrcoef() computes the Pearson correlation matrix directly on arrays. Wrap the result in a DataFrame for readability:

import numpy as np
import pandas as pd

df = pd.DataFrame({
'Feature_1': [7, 9, 5, 8, 6],
'Feature_2': [1, 2, 3, 4, 5],
'Feature_3': [10, 9, 8, 7, 6]
})

# Compute correlation (transpose so columns become rows for corrcoef)
corr_array = np.corrcoef(df.values.T)

# Convert back to labeled DataFrame
corr_df = pd.DataFrame(corr_array, index=df.columns, columns=df.columns)
print(corr_df.round(3))

Output:

           Feature_1  Feature_2  Feature_3
Feature_1 1.0 -0.3 0.3
Feature_2 -0.3 1.0 -1.0
Feature_3 0.3 -1.0 1.0

Using SciPy for Correlation with P-Values

When you need statistical significance (p-values) alongside correlation coefficients, use scipy.stats.pearsonr:

import pandas as pd
from scipy.stats import pearsonr

df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [5, 4, 3, 2, 1],
'C': [2, 3, 2, 3, 2]
})

corr_matrix = pd.DataFrame(index=df.columns, columns=df.columns, dtype=float)
pval_matrix = pd.DataFrame(index=df.columns, columns=df.columns, dtype=float)

for col1 in df.columns:
for col2 in df.columns:
corr, pval = pearsonr(df[col1], df[col2])
corr_matrix.loc[col1, col2] = round(corr, 3)
pval_matrix.loc[col1, col2] = round(pval, 4)

print("Correlation Coefficients:")
print(corr_matrix)
print("\nP-Values:")
print(pval_matrix)

Output:

Correlation Coefficients:
A B C
A 1.0 -1.0 0.0
B -1.0 1.0 0.0
C 0.0 0.0 1.0

P-Values:
A B C
A 0.0 0.0 1.0
B 0.0 0.0 1.0
C 1.0 1.0 0.0
Interpreting P-Values

A p-value below 0.05 generally indicates that the correlation is statistically significant. In this example:

  • A ↔ B: p = 0.0 - highly significant (the correlation of -1.0 is real).
  • A ↔ C: p = 1.0 - not significant (the correlation of 0.0 means no relationship).

Filtering Strong Correlations

To extract only the pairs with strong correlations:

import pandas as pd
import numpy as np

df = pd.DataFrame({
'Height': [170, 165, 180, 175, 160],
'Weight': [70, 60, 85, 75, 55],
'Age': [25, 30, 35, 28, 45],
'Income': [50000, 55000, 60000, 52000, 48000]
})

corr_matrix = df.corr()

# Get upper triangle (avoid duplicates)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find pairs with correlation above threshold
threshold = 0.7
strong_pairs = [
(col, upper.index[row], upper.iloc[row, col_idx])
for col_idx, col in enumerate(upper.columns)
for row in range(len(upper))
if abs(upper.iloc[row, col_idx]) > threshold
]

print(f"Strongly correlated pairs (|r| > {threshold}):")
for col1, col2, corr in strong_pairs:
print(f" {col1}{col2}: {corr:.3f}")

Output:

Strongly correlated pairs (|r| > 0.7):
Weight ↔ Height: 0.993
Income ↔ Height: 0.708

Quick Comparison of Methods

MethodTypeP-ValuesNon-LinearBest For
df.corr() (Pearson)LinearStandard linear relationships
df.corr(method='spearman')Rank-based✅ MonotonicNon-linear monotonic data
df.corr(method='kendall')Rank-based✅ MonotonicSmall samples, ordinal data
np.corrcoef()LinearFast computation on arrays
scipy.stats.pearsonrLinearStatistical significance testing

Conclusion

Creating a correlation matrix in pandas is straightforward and provides valuable insights into your data:

  • Use df.corr() for a quick Pearson correlation matrix - the most common and simplest approach.
  • Use Spearman or Kendall methods when dealing with non-linear relationships, outliers, or ordinal data.
  • Visualize with seaborn heatmaps to make patterns immediately apparent.
  • Use SciPy when you need p-values to assess statistical significance.
  • Filter strong correlations to focus on the most meaningful relationships in high-dimensional datasets.

Understanding correlations is a critical first step in feature selection, multicollinearity detection, and exploratory data analysis.