Python Pandas: How to Calculate Correlation Between Two Columns in Pandas
Correlation measures the strength and direction of the linear relationship between two numerical variables. It produces a value between -1 and +1, where:
- +1 indicates a perfect positive correlation (as one increases, the other increases)
- -1 indicates a perfect negative correlation (as one increases, the other decreases)
- 0 indicates no linear correlation
Calculating correlation is a fundamental step in exploratory data analysis, feature selection, and understanding relationships within your data. This guide covers multiple methods to compute correlation in Python using Pandas, NumPy, and SciPy.
Sample Data
All examples in this guide use the following DataFrame:
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
print(data)
Output:
math_score science_score english_score history_score
0 85 89 78 70
1 78 81 75 68
2 92 94 85 80
3 88 90 80 72
4 76 80 72 65
Using Series.corr() - Two Columns
The Series.corr() method calculates the Pearson correlation coefficient between two individual columns. This is the simplest approach when you only need the correlation between a specific pair:
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
corr = data['math_score'].corr(data['science_score'])
print(f"Correlation between math and science: {corr:.4f}")
Output:
Correlation between math and science: 0.9932
A value of 0.99 indicates a very strong positive correlation, i.e. students who score higher in math also tend to score higher in science.
Specifying the Correlation Method
By default, corr() calculates Pearson correlation. You can also compute Spearman or Kendall correlation:
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
# Pearson (default): measures linear relationship
pearson = data['math_score'].corr(data['science_score'], method='pearson')
# Spearman: measures monotonic relationship (rank-based)
spearman = data['math_score'].corr(data['science_score'], method='spearman')
# Kendall: measures ordinal association
kendall = data['math_score'].corr(data['science_score'], method='kendall')
print(f"Pearson: {pearson:.4f}")
print(f"Spearman: {spearman:.4f}")
print(f"Kendall: {kendall:.4f}")
Output:
Pearson: 0.9932
Spearman: 1.0000
Kendall: 1.0000
| Method | Best For |
|---|---|
| Pearson | Linear relationships with normally distributed data |
| Spearman | Monotonic relationships, ordinal data, or data with outliers |
| Kendall | Small sample sizes or ordinal data |
Using DataFrame.corr() - All Columns at Once
The DataFrame.corr() method computes a correlation matrix showing pairwise correlations between all numeric columns:
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
correlation_matrix = data.corr()
print(correlation_matrix.round(4))
Output:
math_score science_score english_score history_score
math_score 1.0000 0.9932 0.9766 0.9269
science_score 0.9932 1.0000 0.9588 0.9046
english_score 0.9766 0.9588 1.0000 0.9821
history_score 0.9269 0.9046 0.9821 1.0000
Each cell shows the correlation between the row variable and the column variable. The diagonal is always 1.0 (each variable is perfectly correlated with itself).
Selecting Specific Columns
If you only want correlations for certain columns:
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
# Correlation between specific columns only
subset_corr = data[['math_score', 'science_score', 'english_score']].corr()
print(subset_corr.round(4))
Output:
math_score science_score english_score
math_score 1.0000 0.9932 0.9766
science_score 0.9932 1.0000 0.9588
english_score 0.9766 0.9588 1.0000
Extracting a Specific Value from the Matrix
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
correlation_matrix = data.corr()
# Get correlation between math and history from the matrix
corr_value = correlation_matrix.loc['math_score', 'history_score']
print(f"Math vs History: {corr_value:.4f}")
Output:
Math vs History: 0.9269
Using numpy.corrcoef()
NumPy's corrcoef() computes the Pearson correlation coefficient matrix. It's useful when you're already working with NumPy arrays:
import numpy as np
import pandas as pd
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
corr_matrix = np.corrcoef(data['math_score'], data['english_score'])
corr_value = corr_matrix[0, 1]
print(f"Correlation (NumPy): {corr_value:.4f}")
Output:
Correlation (NumPy): 0.9766
np.corrcoef() returns a 2×2 matrix, i.e. the correlation value between the two inputs is at position [0, 1] (or [1, 0]).
Using scipy.stats.pearsonr() - With Statistical Significance
SciPy's pearsonr() returns both the correlation coefficient and the p-value, which tells you whether the correlation is statistically significant:
import pandas as pd
from scipy.stats import pearsonr
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
corr, p_value = pearsonr(data['science_score'], data['history_score'])
print(f"Correlation: {corr:.4f}")
print(f"P-value: {p_value:.4f}")
Output:
Correlation: 0.9046
P-value: 0.0349
Interpreting the P-value
| P-value | Interpretation |
|---|---|
| < 0.01 | Very strong evidence of correlation |
| < 0.05 | Strong evidence of correlation |
| < 0.10 | Weak evidence of correlation |
| ≥ 0.10 | No significant evidence of correlation |
A p-value of 0.035 (< 0.05) indicates the correlation between science and history scores is statistically significant.
SciPy also provides spearmanr() and kendalltau() for non-parametric correlation with p-values:
import pandas as pd
from scipy.stats import spearmanr, kendalltau
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
corr, p = spearmanr(data['math_score'], data['science_score'])
print(f"Spearman: {corr:.4f}, p-value: {p:.4f}")
corr, p = kendalltau(data['math_score'], data['science_score'])
print(f"Kendall: {corr:.4f}, p-value: {p:.4f}")
Output:
Spearman: 1.0000, p-value: 0.0000
Kendall: 1.0000, p-value: 0.0167
Visualizing Correlation with a Heatmap
A visual correlation matrix makes patterns immediately apparent:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76],
'science_score': [89, 81, 94, 90, 80],
'english_score': [78, 75, 85, 80, 72],
'history_score': [70, 68, 80, 72, 65]
})
plt.figure(figsize=(8, 6))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm',
vmin=-1, vmax=1, fmt='.3f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()
This produces a color-coded matrix where:
- Red indicates strong positive correlation
- Blue indicates strong negative correlation
- White/pale indicates weak or no correlation
Interpreting Correlation Values
| Range | Strength |
|---|---|
| 0.90 to 1.00 | Very strong positive |
| 0.70 to 0.89 | Strong positive |
| 0.40 to 0.69 | Moderate positive |
| 0.10 to 0.39 | Weak positive |
| -0.10 to 0.10 | Negligible |
| -0.39 to -0.10 | Weak negative |
| -0.69 to -0.40 | Moderate negative |
| -0.89 to -0.70 | Strong negative |
| -1.00 to -0.90 | Very strong negative |
A high correlation between two variables does not mean one causes the other. For example, ice cream sales and drowning incidents are correlated (both increase in summer), but ice cream doesn't cause drowning - the hidden variable is temperature.
Comparison of Methods
| Method | Returns P-value | Multiple Columns | Best For |
|---|---|---|---|
Series.corr() | ❌ | Two columns | Quick pairwise check |
DataFrame.corr() | ❌ | All columns | Overview of all relationships |
numpy.corrcoef() | ❌ | Two arrays | NumPy-based workflows |
scipy.pearsonr() | ✅ | Two columns | Statistical significance testing |
Conclusion
- For a quick correlation check between two columns,
Series.corr()is the simplest choice. - To see relationships across an entire dataset,
DataFrame.corr()provides a complete correlation matrix. - When you need statistical significance (p-values) to validate your findings, use
scipy.stats.pearsonr(). - For non-linear or rank-based relationships, switch to Spearman or Kendall methods.
Combining these tools with a heatmap visualization gives you a comprehensive understanding of the relationships in your data.