Relationship Exploration in EDA: Correlation, Independence, and the Right Test
Choosing the wrong correlation or independence test produces misleading results. The right choice depends on distribution shape, modality, skewness, and cardinality of the columns involved.
Relationship exploration is one of the more under-specified steps in EDA. Most tutorials show a correlation heatmap, declare features above 0.8 as redundant, and move on. That approach misses the point. The correlation coefficient is only one of several possible measures of association, and it is only the right one under specific conditions. Using it indiscriminately produces numbers that appear meaningful but can be close to zero for variables that are strongly related in a non-linear way.
The purpose of relationship exploration is four-fold: finding redundant features, understanding how features interact with each other and with domain knowledge, informing what feature engineering is worth doing, and verifying the assumptions that the chosen model requires.
Correlation: Pearson and Spearman
Pearson correlation measures the strength of the linear relationship between two continuous variables. It assumes that both variables are approximately normally distributed and that the relationship between them is linear. It is sensitive to outliers, because a single extreme value can produce a large Pearson coefficient between variables that are otherwise unrelated.
Spearman correlation measures the monotonic relationship between two variables. It works by ranking each variable and computing the Pearson correlation on the ranks. This means it does not assume normality, handles non-linear but monotonic relationships correctly, and is resistant to outliers. If one variable increases whenever the other increases (or decreases), Spearman captures that, even if the relationship is curved rather than straight.
The prerequisite for choosing between them is knowing the distribution of both columns. This is why relationship exploration belongs after individual column analysis, not before. If either column is skewed, non-normal, or contains outliers, Spearman is the safer choice. If both columns are approximately normal and the relationship is expected to be linear, Pearson is appropriate.
Partial correlation extends this to a three-variable question: what is the correlation between A and B after removing the effect of C? If age and income are both correlated with credit score, the correlation between age and income may be partially driven by their shared relationship with credit score. Partial correlation isolates the direct relationship.
import numpy as np
import pandas as pd
from scipy import stats
rng = np.random.default_rng(42)
n = 300
# Two variables with a monotonic but non-linear relationship
x = rng.uniform(1, 10, n)
y = x ** 2 + rng.normal(0, 5, n)
pearson_r, pearson_p = stats.pearsonr(x, y)
spearman_r, spearman_p = stats.spearmanr(x, y)
print(f"Pearson: r={pearson_r:.3f}, p={pearson_p:.4f}")
print(f"Spearman: r={spearman_r:.3f}, p={spearman_p:.4f}")
# Spearman will be higher here because the relationship is monotonic but curved
Testing Independence
When the question shifts from “how strongly are these related?” to “are these two variables independent?”, the right tools are hypothesis tests.
For two continuous variables, the Pearson or Spearman p-value directly tests the null hypothesis that the population correlation is zero. A low p-value means the null can be rejected; the variables are not independent under the tested relationship structure.
For a continuous variable against a categorical one, the choice depends on how many groups the categorical variable has and whether the continuous variable is normally distributed. The independent samples t-test compares the means of two groups and assumes normality and equal variance. When comparing more than two groups, one-way ANOVA is the extension. Both tests answer the same structural question: is the mean of the continuous variable different across the groups defined by the categorical variable?
The one-sample t-test is the simpler case: does the mean of a single group differ from a known reference value? The paired t-test applies when the same observations are measured twice, such as before and after an intervention, because it removes the between-subject variance and focuses on the within-subject change.
For two categorical variables, the chi-squared test of independence tests whether the distribution of one categorical variable differs across the levels of another. A common application is testing whether fraud rate is independent of product category. The test requires that expected cell counts are sufficient, typically at least 5 per cell. High-cardinality columns produce many cells with very small expected counts, which invalidates the test. Before applying chi-squared to categorical features, check the cardinality. If either column has many rare categories, group them first.
Non-Parametric Alternatives
When normality cannot be assumed, the t-test and ANOVA have non-parametric equivalents that use ranks rather than raw values.
Mann-Whitney U is the non-parametric alternative to the independent t-test for two groups. It tests whether one group tends to have higher values than the other, without assuming any particular distribution. Kruskal-Wallis extends this to three or more groups, replacing ANOVA.
from scipy import stats
# Two groups, normality not assumed
group_a = rng.lognormal(mean=3, sigma=0.8, size=100)
group_b = rng.lognormal(mean=3.3, sigma=0.8, size=100)
t_stat, t_p = stats.ttest_ind(group_a, group_b)
u_stat, mw_p = stats.mannwhitneyu(group_a, group_b, alternative="two-sided")
print(f"Independent t-test: t={t_stat:.3f}, p={t_p:.4f}")
print(f"Mann-Whitney U: U={u_stat:.0f}, p={mw_p:.4f}")
When the data is skewed, the Mann-Whitney p-value is more trustworthy because the t-test’s assumption is violated.
How to Choose
The choice of test follows from the answers to four questions about the columns involved: what is their distribution shape, do they have notable skewness or kurtosis, what is their cardinality, and is the relationship expected to be linear or monotonic?
Normal continuous vs normal continuous, linear relationship: Pearson. Non-normal, skewed, or ordinal vs any continuous: Spearman. Continuous vs binary or low-cardinality categorical, normal: t-test or ANOVA. Continuous vs categorical, non-normal: Mann-Whitney or Kruskal-Wallis. Categorical vs categorical, low-to-moderate cardinality: chi-squared. Categorical vs categorical, high cardinality: group rare categories before testing.
These rules come from the assumptions embedded in each test. Violating the assumptions does not produce an error message; it produces a p-value that appears meaningful but is not.
What You Can Do Now
Run this block on any two columns in a dataset to get a test recommendation based on their properties:
from scipy import stats
import pandas as pd
import numpy as np
def recommend_test(a: pd.Series, b: pd.Series) -> str:
a_num = pd.api.types.is_numeric_dtype(a)
b_num = pd.api.types.is_numeric_dtype(b)
if a_num and b_num:
_, a_norm = stats.normaltest(a.dropna())
_, b_norm = stats.normaltest(b.dropna())
if a_norm > 0.05 and b_norm > 0.05:
return "Both normal -> use Pearson correlation"
else:
return "Non-normal -> use Spearman correlation"
elif a_num and not b_num:
n_groups = b.nunique()
_, norm_p = stats.normaltest(a.dropna())
if norm_p > 0.05:
return f"{n_groups} groups, normal -> use {'t-test' if n_groups == 2 else 'ANOVA'}"
else:
return f"{n_groups} groups, non-normal -> use {'Mann-Whitney' if n_groups == 2 else 'Kruskal-Wallis'}"
elif not a_num and not b_num:
max_card = max(a.nunique(), b.nunique())
if max_card > 20:
return f"High cardinality ({max_card}) -> group rare categories before chi-squared"
return "Both categorical -> use chi-squared test of independence"
else:
return recommend_test(b, a)
Pass two columns from your current dataset and let the output guide you to the right test before computing any correlation or p-value.