How to Perform Correspondence Analysis in Python

Correspondence Analysis (CA) is a dimensionality reduction technique designed for categorical data organized in contingency tables. Similar to PCA for numerical data, CA visualizes relationships between row categories (e.g., regions) and column categories (e.g., products) on a two-dimensional map.

Setting Up the Environment

The prince library provides the standard implementation for CA and MCA in Python.

pip install prince pandas matplotlib

Preparing the Contingency Table

CA requires a cross-tabulation table where rows and columns represent categorical variables and cells contain frequency counts.

import pandas as pd

# Rows: Regions, Columns: Beverage preferences
data = {
    'Coffee': [50, 30, 20],
    'Tea':    [30, 60, 10],
    'Soda':   [10, 20, 80]
}
df = pd.DataFrame(data, index=['North', 'South', 'East'])
print(df)

Output:

       Coffee  Tea  Soda
North      50   30    10
South      30   60    20
East       20   10    80

Fitting the CA Model

Initialize and fit the model by specifying the number of components to extract:

import prince

# Initialize CA with 2 dimensions
ca = prince.CA(n_components=2)
ca = ca.fit(df)

To examine the explained variance for each dimension:

print(ca.eigenvalues_summary)

Creating the Biplot Visualization

The biplot displays both row and column categories in the same coordinate space, revealing their associations.

import matplotlib.pyplot as plt

# Extract coordinates for custom plotting
row_coords = ca.row_coordinates(df)
col_coords = ca.column_coordinates(df)

print("Row coordinates:\n", row_coords)
print("\nColumn coordinates:\n", col_coords)

Using the built-in plotting method:

ax = ca.plot(
    df,
    figsize=(8, 8),
    x_component=0,
    y_component=1,
    show_row_markers=True,
    show_col_markers=True,
    show_row_labels=True,
    show_col_labels=True
)
plt.title("Correspondence Analysis Biplot")
plt.axhline(y=0, color='gray', linestyle='--', linewidth=0.5)
plt.axvline(x=0, color='gray', linestyle='--', linewidth=0.5)
plt.savefig("ca_biplot.png", dpi=150, bbox_inches='tight')
plt.show()

For more control over the visualization:

fig, ax = plt.subplots(figsize=(10, 8))

# Plot row points
ax.scatter(row_coords[0], row_coords[1], c='blue', s=100, label='Regions')
for idx, row in row_coords.iterrows():
    ax.annotate(idx, (row[0], row[1]), fontsize=12, color='blue')

# Plot column points
ax.scatter(col_coords[0], col_coords[1], c='red', s=100, marker='^', label='Beverages')
for idx, row in col_coords.iterrows():
    ax.annotate(idx, (row[0], row[1]), fontsize=12, color='red')

ax.axhline(y=0, color='gray', linestyle='--', linewidth=0.5)
ax.axvline(x=0, color='gray', linestyle='--', linewidth=0.5)
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.legend()
ax.set_title('CA Biplot: Regional Beverage Preferences')
plt.savefig("ca_biplot_custom.png", dpi=150, bbox_inches='tight')
plt.show()

Interpreting the Results

Understanding the biplot requires attention to several key aspects:

Visual Cue	Interpretation
Proximity	Categories plotted close together have strong positive associations
Distance from origin	Points far from center (0,0) contribute most to the overall variance
Opposite positions	Categories on opposite sides of an axis are negatively correlated
Axis alignment	Points aligned along an axis share similar profiles on that dimension

tip

In our example, if "East" appears close to "Soda" on the biplot, this indicates that the East region has a notably higher preference for soda compared to other regions.

Extracting Additional Statistics

# Total inertia (similar to total variance)
print(f"Total inertia: {ca.total_inertia_:.4f}")

# Contribution of each dimension
print(f"Explained inertia: {ca.explained_inertia_}")

# Row and column contributions to each dimension
row_contrib = ca.row_contributions_
col_contrib = ca.column_contributions_
print("\nRow contributions:\n", row_contrib)
print("\nColumn contributions:\n", col_contrib)

When to Use CA vs MCA

note

Use CA when you have a pre-computed contingency table (frequency counts between two categorical variables)
Use MCA (Multiple Correspondence Analysis) when working with raw survey data containing multiple categorical variables that need to be analyzed simultaneously

Summary

Correspondence Analysis transforms complex categorical relationships into interpretable visual maps. By projecting row and column categories onto shared dimensions, CA reveals association patterns that would otherwise remain hidden in large contingency tables.

Setting Up the Environment​

Preparing the Contingency Table​

Fitting the CA Model​

Creating the Biplot Visualization​

Interpreting the Results​

Extracting Additional Statistics​

When to Use CA vs MCA​

Summary​

Table of Contents