Skip to main content

How to Perform Correspondence Analysis in Python

Correspondence Analysis (CA) is a dimensionality reduction technique designed for categorical data organized in contingency tables. Similar to PCA for numerical data, CA visualizes relationships between row categories (e.g., regions) and column categories (e.g., products) on a two-dimensional map.

Setting Up the Environment

The prince library provides the standard implementation for CA and MCA in Python.

pip install prince pandas matplotlib

Preparing the Contingency Table

CA requires a cross-tabulation table where rows and columns represent categorical variables and cells contain frequency counts.

import pandas as pd

# Rows: Regions, Columns: Beverage preferences
data = {
'Coffee': [50, 30, 20],
'Tea': [30, 60, 10],
'Soda': [10, 20, 80]
}
df = pd.DataFrame(data, index=['North', 'South', 'East'])
print(df)

Output:

       Coffee  Tea  Soda
North 50 30 10
South 30 60 20
East 20 10 80

Fitting the CA Model

Initialize and fit the model by specifying the number of components to extract:

import prince

# Initialize CA with 2 dimensions
ca = prince.CA(n_components=2)
ca = ca.fit(df)

To examine the explained variance for each dimension:

print(ca.eigenvalues_summary)

Creating the Biplot Visualization

The biplot displays both row and column categories in the same coordinate space, revealing their associations.

import matplotlib.pyplot as plt

# Extract coordinates for custom plotting
row_coords = ca.row_coordinates(df)
col_coords = ca.column_coordinates(df)

print("Row coordinates:\n", row_coords)
print("\nColumn coordinates:\n", col_coords)

Using the built-in plotting method:

ax = ca.plot(
df,
figsize=(8, 8),
x_component=0,
y_component=1,
show_row_markers=True,
show_col_markers=True,
show_row_labels=True,
show_col_labels=True
)
plt.title("Correspondence Analysis Biplot")
plt.axhline(y=0, color='gray', linestyle='--', linewidth=0.5)
plt.axvline(x=0, color='gray', linestyle='--', linewidth=0.5)
plt.savefig("ca_biplot.png", dpi=150, bbox_inches='tight')
plt.show()

For more control over the visualization:

fig, ax = plt.subplots(figsize=(10, 8))

# Plot row points
ax.scatter(row_coords[0], row_coords[1], c='blue', s=100, label='Regions')
for idx, row in row_coords.iterrows():
ax.annotate(idx, (row[0], row[1]), fontsize=12, color='blue')

# Plot column points
ax.scatter(col_coords[0], col_coords[1], c='red', s=100, marker='^', label='Beverages')
for idx, row in col_coords.iterrows():
ax.annotate(idx, (row[0], row[1]), fontsize=12, color='red')

ax.axhline(y=0, color='gray', linestyle='--', linewidth=0.5)
ax.axvline(x=0, color='gray', linestyle='--', linewidth=0.5)
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.legend()
ax.set_title('CA Biplot: Regional Beverage Preferences')
plt.savefig("ca_biplot_custom.png", dpi=150, bbox_inches='tight')
plt.show()

Interpreting the Results

Understanding the biplot requires attention to several key aspects:

Visual CueInterpretation
ProximityCategories plotted close together have strong positive associations
Distance from originPoints far from center (0,0) contribute most to the overall variance
Opposite positionsCategories on opposite sides of an axis are negatively correlated
Axis alignmentPoints aligned along an axis share similar profiles on that dimension
tip

In our example, if "East" appears close to "Soda" on the biplot, this indicates that the East region has a notably higher preference for soda compared to other regions.

Extracting Additional Statistics

# Total inertia (similar to total variance)
print(f"Total inertia: {ca.total_inertia_:.4f}")

# Contribution of each dimension
print(f"Explained inertia: {ca.explained_inertia_}")

# Row and column contributions to each dimension
row_contrib = ca.row_contributions_
col_contrib = ca.column_contributions_
print("\nRow contributions:\n", row_contrib)
print("\nColumn contributions:\n", col_contrib)

When to Use CA vs MCA

note
  • Use CA when you have a pre-computed contingency table (frequency counts between two categorical variables)
  • Use MCA (Multiple Correspondence Analysis) when working with raw survey data containing multiple categorical variables that need to be analyzed simultaneously

Summary

Correspondence Analysis transforms complex categorical relationships into interpretable visual maps. By projecting row and column categories onto shared dimensions, CA reveals association patterns that would otherwise remain hidden in large contingency tables.