Python Pandas: How to Calculate Correlation Between Two Columns in Pandas

Correlation measures the strength and direction of the linear relationship between two numerical variables. It produces a value between -1 and +1, where:

+1 indicates a perfect positive correlation (as one increases, the other increases)
-1 indicates a perfect negative correlation (as one increases, the other decreases)
0 indicates no linear correlation

Calculating correlation is a fundamental step in exploratory data analysis, feature selection, and understanding relationships within your data. This guide covers multiple methods to compute correlation in Python using Pandas, NumPy, and SciPy.

Sample Data

All examples in this guide use the following DataFrame:

import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})

print(data)

Output:

   math_score  science_score  english_score  history_score
        85             89             78             70
        78             81             75             68
        92             94             85             80
        88             90             80             72
        76             80             72             65

Using `Series.corr()` - Two Columns

The Series.corr() method calculates the Pearson correlation coefficient between two individual columns. This is the simplest approach when you only need the correlation between a specific pair:

import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})


corr = data['math_score'].corr(data['science_score'])
print(f"Correlation between math and science: {corr:.4f}")

Output:

Correlation between math and science: 0.9932

A value of 0.99 indicates a very strong positive correlation, i.e. students who score higher in math also tend to score higher in science.

Specifying the Correlation Method

By default, corr() calculates Pearson correlation. You can also compute Spearman or Kendall correlation:

import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})

# Pearson (default): measures linear relationship
pearson = data['math_score'].corr(data['science_score'], method='pearson')

# Spearman: measures monotonic relationship (rank-based)
spearman = data['math_score'].corr(data['science_score'], method='spearman')

# Kendall: measures ordinal association
kendall = data['math_score'].corr(data['science_score'], method='kendall')

print(f"Pearson:  {pearson:.4f}")
print(f"Spearman: {spearman:.4f}")
print(f"Kendall:  {kendall:.4f}")

Output:

Pearson:  0.9932
Spearman: 1.0000
Kendall:  1.0000

When to use which method

Method	Best For
Pearson	Linear relationships with normally distributed data
Spearman	Monotonic relationships, ordinal data, or data with outliers
Kendall	Small sample sizes or ordinal data

Using `DataFrame.corr()` - All Columns at Once

The DataFrame.corr() method computes a correlation matrix showing pairwise correlations between all numeric columns:

import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})

correlation_matrix = data.corr()
print(correlation_matrix.round(4))

Output:

               math_score  science_score  english_score  history_score
math_score         1.0000         0.9932         0.9766         0.9269
science_score      0.9932         1.0000         0.9588         0.9046
english_score      0.9766         0.9588         1.0000         0.9821
history_score      0.9269         0.9046         0.9821         1.0000

Each cell shows the correlation between the row variable and the column variable. The diagonal is always 1.0 (each variable is perfectly correlated with itself).

Selecting Specific Columns

If you only want correlations for certain columns:

import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})

# Correlation between specific columns only
subset_corr = data[['math_score', 'science_score', 'english_score']].corr()
print(subset_corr.round(4))

Output:

               math_score  science_score  english_score
math_score         1.0000         0.9932         0.9766
science_score      0.9932         1.0000         0.9588
english_score      0.9766         0.9588         1.0000

Extracting a Specific Value from the Matrix

import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})
correlation_matrix = data.corr()

# Get correlation between math and history from the matrix
corr_value = correlation_matrix.loc['math_score', 'history_score']
print(f"Math vs History: {corr_value:.4f}")

Output:

Math vs History: 0.9269

Using `numpy.corrcoef()`

NumPy's corrcoef() computes the Pearson correlation coefficient matrix. It's useful when you're already working with NumPy arrays:

import numpy as np
import pandas as pd

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})

corr_matrix = np.corrcoef(data['math_score'], data['english_score'])
corr_value = corr_matrix[0, 1]

print(f"Correlation (NumPy): {corr_value:.4f}")

Output:

Correlation (NumPy): 0.9766

note

np.corrcoef() returns a 2×2 matrix, i.e. the correlation value between the two inputs is at position [0, 1] (or [1, 0]).

Using `scipy.stats.pearsonr()` - With Statistical Significance

SciPy's pearsonr() returns both the correlation coefficient and the p-value, which tells you whether the correlation is statistically significant:

import pandas as pd
from scipy.stats import pearsonr

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})

corr, p_value = pearsonr(data['science_score'], data['history_score'])

print(f"Correlation: {corr:.4f}")
print(f"P-value:     {p_value:.4f}")

Output:

Correlation: 0.9046
P-value:     0.0349

Interpreting the P-value

P-value	Interpretation
< 0.01	Very strong evidence of correlation
< 0.05	Strong evidence of correlation
< 0.10	Weak evidence of correlation
≥ 0.10	No significant evidence of correlation

A p-value of 0.035 (< 0.05) indicates the correlation between science and history scores is statistically significant.

note

SciPy also provides spearmanr() and kendalltau() for non-parametric correlation with p-values:

import pandas as pd
from scipy.stats import spearmanr, kendalltau

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})

corr, p = spearmanr(data['math_score'], data['science_score'])
print(f"Spearman: {corr:.4f}, p-value: {p:.4f}")

corr, p = kendalltau(data['math_score'], data['science_score'])
print(f"Kendall:  {corr:.4f}, p-value: {p:.4f}")

Output:

Spearman: 1.0000, p-value: 0.0000
Kendall:  1.0000, p-value: 0.0167

Visualizing Correlation with a Heatmap

A visual correlation matrix makes patterns immediately apparent:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.DataFrame({
    'math_score': [85, 78, 92, 88, 76],
    'science_score': [89, 81, 94, 90, 80],
    'english_score': [78, 75, 85, 80, 72],
    'history_score': [70, 68, 80, 72, 65]
})

plt.figure(figsize=(8, 6))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm',
            vmin=-1, vmax=1, fmt='.3f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

This produces a color-coded matrix where:

Red indicates strong positive correlation
Blue indicates strong negative correlation
White/pale indicates weak or no correlation

Interpreting Correlation Values

Range	Strength
0.90 to 1.00	Very strong positive
0.70 to 0.89	Strong positive
0.40 to 0.69	Moderate positive
0.10 to 0.39	Weak positive
-0.10 to 0.10	Negligible
-0.39 to -0.10	Weak negative
-0.69 to -0.40	Moderate negative
-0.89 to -0.70	Strong negative
-1.00 to -0.90	Very strong negative

Correlation ≠ Causation

A high correlation between two variables does not mean one causes the other. For example, ice cream sales and drowning incidents are correlated (both increase in summer), but ice cream doesn't cause drowning - the hidden variable is temperature.

Comparison of Methods

Method	Returns P-value	Multiple Columns	Best For
`Series.corr()`	❌	Two columns	Quick pairwise check
`DataFrame.corr()`	❌	All columns	Overview of all relationships
`numpy.corrcoef()`	❌	Two arrays	NumPy-based workflows
`scipy.pearsonr()`	✅	Two columns	Statistical significance testing

Conclusion

For a quick correlation check between two columns, Series.corr() is the simplest choice.
To see relationships across an entire dataset, DataFrame.corr() provides a complete correlation matrix.
When you need statistical significance (p-values) to validate your findings, use scipy.stats.pearsonr().
For non-linear or rank-based relationships, switch to Spearman or Kendall methods.

Combining these tools with a heatmap visualization gives you a comprehensive understanding of the relationships in your data.

Sample Data​

Using Series.corr() - Two Columns​

Specifying the Correlation Method​

Using DataFrame.corr() - All Columns at Once​

Selecting Specific Columns​

Extracting a Specific Value from the Matrix​

Using numpy.corrcoef()​

Using scipy.stats.pearsonr() - With Statistical Significance​

Interpreting the P-value​

Visualizing Correlation with a Heatmap​

Interpreting Correlation Values​

Comparison of Methods​

Conclusion​

Table of Contents

Sample Data

Using `Series.corr()` - Two Columns

Specifying the Correlation Method

Using `DataFrame.corr()` - All Columns at Once

Selecting Specific Columns

Extracting a Specific Value from the Matrix

Using `numpy.corrcoef()`

Using `scipy.stats.pearsonr()` - With Statistical Significance

Interpreting the P-value

Visualizing Correlation with a Heatmap

Interpreting Correlation Values

Comparison of Methods

Conclusion