How to Perform Correspondence Analysis in Python
Correspondence Analysis (CA) is a dimensionality reduction technique designed for categorical data organized in contingency tables. Similar to PCA for numerical data, CA visualizes relationships between row categories (e.g., regions) and column categories (e.g., products) on a two-dimensional map.
Setting Up the Environment
The prince library provides the standard implementation for CA and MCA in Python.
pip install prince pandas matplotlib
Preparing the Contingency Table
CA requires a cross-tabulation table where rows and columns represent categorical variables and cells contain frequency counts.
import pandas as pd
# Rows: Regions, Columns: Beverage preferences
data = {
'Coffee': [50, 30, 20],
'Tea': [30, 60, 10],
'Soda': [10, 20, 80]
}
df = pd.DataFrame(data, index=['North', 'South', 'East'])
print(df)
Output:
Coffee Tea Soda
North 50 30 10
South 30 60 20
East 20 10 80
Fitting the CA Model
Initialize and fit the model by specifying the number of components to extract:
import prince
# Initialize CA with 2 dimensions
ca = prince.CA(n_components=2)
ca = ca.fit(df)
To examine the explained variance for each dimension:
print(ca.eigenvalues_summary)
Creating the Biplot Visualization
The biplot displays both row and column categories in the same coordinate space, revealing their associations.
import matplotlib.pyplot as plt
# Extract coordinates for custom plotting
row_coords = ca.row_coordinates(df)
col_coords = ca.column_coordinates(df)
print("Row coordinates:\n", row_coords)
print("\nColumn coordinates:\n", col_coords)
Using the built-in plotting method:
ax = ca.plot(
df,
figsize=(8, 8),
x_component=0,
y_component=1,
show_row_markers=True,
show_col_markers=True,
show_row_labels=True,
show_col_labels=True
)
plt.title("Correspondence Analysis Biplot")
plt.axhline(y=0, color='gray', linestyle='--', linewidth=0.5)
plt.axvline(x=0, color='gray', linestyle='--', linewidth=0.5)
plt.savefig("ca_biplot.png", dpi=150, bbox_inches='tight')
plt.show()
For more control over the visualization:
fig, ax = plt.subplots(figsize=(10, 8))
# Plot row points
ax.scatter(row_coords[0], row_coords[1], c='blue', s=100, label='Regions')
for idx, row in row_coords.iterrows():
ax.annotate(idx, (row[0], row[1]), fontsize=12, color='blue')
# Plot column points
ax.scatter(col_coords[0], col_coords[1], c='red', s=100, marker='^', label='Beverages')
for idx, row in col_coords.iterrows():
ax.annotate(idx, (row[0], row[1]), fontsize=12, color='red')
ax.axhline(y=0, color='gray', linestyle='--', linewidth=0.5)
ax.axvline(x=0, color='gray', linestyle='--', linewidth=0.5)
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.legend()
ax.set_title('CA Biplot: Regional Beverage Preferences')
plt.savefig("ca_biplot_custom.png", dpi=150, bbox_inches='tight')
plt.show()
Interpreting the Results
Understanding the biplot requires attention to several key aspects:
| Visual Cue | Interpretation |
|---|---|
| Proximity | Categories plotted close together have strong positive associations |
| Distance from origin | Points far from center (0,0) contribute most to the overall variance |
| Opposite positions | Categories on opposite sides of an axis are negatively correlated |
| Axis alignment | Points aligned along an axis share similar profiles on that dimension |
In our example, if "East" appears close to "Soda" on the biplot, this indicates that the East region has a notably higher preference for soda compared to other regions.
Extracting Additional Statistics
# Total inertia (similar to total variance)
print(f"Total inertia: {ca.total_inertia_:.4f}")
# Contribution of each dimension
print(f"Explained inertia: {ca.explained_inertia_}")
# Row and column contributions to each dimension
row_contrib = ca.row_contributions_
col_contrib = ca.column_contributions_
print("\nRow contributions:\n", row_contrib)
print("\nColumn contributions:\n", col_contrib)
When to Use CA vs MCA
- Use CA when you have a pre-computed contingency table (frequency counts between two categorical variables)
- Use MCA (Multiple Correspondence Analysis) when working with raw survey data containing multiple categorical variables that need to be analyzed simultaneously
Summary
Correspondence Analysis transforms complex categorical relationships into interpretable visual maps. By projecting row and column categories onto shared dimensions, CA reveals association patterns that would otherwise remain hidden in large contingency tables.