Skip to main content

Python Pandas: How to Calculate Correlation Between Two Columns in Pandas

Correlation measures the strength and direction of the linear relationship between two numerical variables. It produces a value between -1 and +1, where:

  • +1 indicates a perfect positive correlation (as one increases, the other increases)
  • -1 indicates a perfect negative correlation (as one increases, the other decreases)
  • 0 indicates no linear correlation

Calculating correlation is a fundamental step in exploratory data analysis, feature selection, and understanding relationships within your data. This guide covers multiple methods to compute correlation in Python using Pandas, NumPy, and SciPy.

Sample Data

All examples in this guide use the following DataFrame:

import pandas as pd

data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})

print(data)

Output:

   math_score  science_score  english_score  history_score
0 85 89 78 70
1 78 81 75 68
2 92 94 85 80
3 88 90 80 72
4 76 80 72 65

Using Series.corr() - Two Columns

The Series.corr() method calculates the Pearson correlation coefficient between two individual columns. This is the simplest approach when you only need the correlation between a specific pair:

import pandas as pd

data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})


corr = data['math_score'].corr(data['science_score'])
print(f"Correlation between math and science: {corr:.4f}")

Output:

Correlation between math and science: 0.9932

A value of 0.99 indicates a very strong positive correlation, i.e. students who score higher in math also tend to score higher in science.

Specifying the Correlation Method

By default, corr() calculates Pearson correlation. You can also compute Spearman or Kendall correlation:

import pandas as pd

data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})

# Pearson (default): measures linear relationship
pearson = data['math_score'].corr(data['science_score'], method='pearson')

# Spearman: measures monotonic relationship (rank-based)
spearman = data['math_score'].corr(data['science_score'], method='spearman')

# Kendall: measures ordinal association
kendall = data['math_score'].corr(data['science_score'], method='kendall')

print(f"Pearson: {pearson:.4f}")
print(f"Spearman: {spearman:.4f}")
print(f"Kendall: {kendall:.4f}")

Output:

Pearson:  0.9932
Spearman: 1.0000
Kendall: 1.0000
When to use which method
MethodBest For
PearsonLinear relationships with normally distributed data
SpearmanMonotonic relationships, ordinal data, or data with outliers
KendallSmall sample sizes or ordinal data

Using DataFrame.corr() - All Columns at Once

The DataFrame.corr() method computes a correlation matrix showing pairwise correlations between all numeric columns:

import pandas as pd

data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})

correlation_matrix = data.corr()
print(correlation_matrix.round(4))

Output:

               math_score  science_score  english_score  history_score
math_score 1.0000 0.9932 0.9766 0.9269
science_score 0.9932 1.0000 0.9588 0.9046
english_score 0.9766 0.9588 1.0000 0.9821
history_score 0.9269 0.9046 0.9821 1.0000

Each cell shows the correlation between the row variable and the column variable. The diagonal is always 1.0 (each variable is perfectly correlated with itself).

Selecting Specific Columns

If you only want correlations for certain columns:

import pandas as pd

data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})

# Correlation between specific columns only
subset_corr = data[['math_score', 'science_score', 'english_score']].corr()
print(subset_corr.round(4))

Output:

               math_score  science_score  english_score
math_score 1.0000 0.9932 0.9766
science_score 0.9932 1.0000 0.9588
english_score 0.9766 0.9588 1.0000

Extracting a Specific Value from the Matrix

import pandas as pd

data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
correlation_matrix = data.corr()

# Get correlation between math and history from the matrix
corr_value = correlation_matrix.loc['math_score', 'history_score']
print(f"Math vs History: {corr_value:.4f}")

Output:

Math vs History: 0.9269

Using numpy.corrcoef()

NumPy's corrcoef() computes the Pearson correlation coefficient matrix. It's useful when you're already working with NumPy arrays:

import numpy as np
import pandas as pd

data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})

corr_matrix = np.corrcoef(data['math_score'], data['english_score'])
corr_value = corr_matrix[0, 1]

print(f"Correlation (NumPy): {corr_value:.4f}")

Output:

Correlation (NumPy): 0.9766
note

np.corrcoef() returns a 2×2 matrix, i.e. the correlation value between the two inputs is at position [0, 1] (or [1, 0]).

Using scipy.stats.pearsonr() - With Statistical Significance

SciPy's pearsonr() returns both the correlation coefficient and the p-value, which tells you whether the correlation is statistically significant:

import pandas as pd
from scipy.stats import pearsonr

data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})

corr, p_value = pearsonr(data['science_score'], data['history_score'])

print(f"Correlation: {corr:.4f}")
print(f"P-value: {p_value:.4f}")

Output:

Correlation: 0.9046
P-value: 0.0349

Interpreting the P-value

P-valueInterpretation
< 0.01Very strong evidence of correlation
< 0.05Strong evidence of correlation
< 0.10Weak evidence of correlation
≥ 0.10No significant evidence of correlation

A p-value of 0.035 (< 0.05) indicates the correlation between science and history scores is statistically significant.

note

SciPy also provides spearmanr() and kendalltau() for non-parametric correlation with p-values:

import pandas as pd
from scipy.stats import spearmanr, kendalltau

data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})

corr, p = spearmanr(data['math_score'], data['science_score'])
print(f"Spearman: {corr:.4f}, p-value: {p:.4f}")

corr, p = kendalltau(data['math_score'], data['science_score'])
print(f"Kendall: {corr:.4f}, p-value: {p:.4f}")

Output:

Spearman: 1.0000, p-value: 0.0000
Kendall: 1.0000, p-value: 0.0167

Visualizing Correlation with a Heatmap

A visual correlation matrix makes patterns immediately apparent:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})

plt.figure(figsize=(8, 6))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm',
vmin=-1, vmax=1, fmt='.3f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

This produces a color-coded matrix where:

  • Red indicates strong positive correlation
  • Blue indicates strong negative correlation
  • White/pale indicates weak or no correlation

Interpreting Correlation Values

RangeStrength
0.90 to 1.00Very strong positive
0.70 to 0.89Strong positive
0.40 to 0.69Moderate positive
0.10 to 0.39Weak positive
-0.10 to 0.10Negligible
-0.39 to -0.10Weak negative
-0.69 to -0.40Moderate negative
-0.89 to -0.70Strong negative
-1.00 to -0.90Very strong negative
Correlation ≠ Causation

A high correlation between two variables does not mean one causes the other. For example, ice cream sales and drowning incidents are correlated (both increase in summer), but ice cream doesn't cause drowning - the hidden variable is temperature.

Comparison of Methods

MethodReturns P-valueMultiple ColumnsBest For
Series.corr()Two columnsQuick pairwise check
DataFrame.corr()All columnsOverview of all relationships
numpy.corrcoef()Two arraysNumPy-based workflows
scipy.pearsonr()Two columnsStatistical significance testing

Conclusion

  • For a quick correlation check between two columns, Series.corr() is the simplest choice.
  • To see relationships across an entire dataset, DataFrame.corr() provides a complete correlation matrix.
  • When you need statistical significance (p-values) to validate your findings, use scipy.stats.pearsonr().
  • For non-linear or rank-based relationships, switch to Spearman or Kendall methods.

Combining these tools with a heatmap visualization gives you a comprehensive understanding of the relationships in your data.