We show how to fish for "statistically significant" correlations in completely random data. "For everyone who asks receives; the one who seeks finds..."
from scipy import stats
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
plt.style.use('seaborn-darkgrid')
%config InlineBackend.figure_format = 'retina'
The following is just to demonstrate the use of the SciPy function pearsonr
. It gives the sample correlation, and also a $p$-value with respect to the null hypothesis that the true correlation is zero. In this case, we create two arrays x
and y
that are constructed to be correlated, and we see a strong correlation and a low $p$-value.
x = np.random.normal(size=50)
y = x + np.random.normal(scale=0.5, size=50)
plt.scatter(x, y)
stats.pearsonr(x, y)
We create a numerical dataset consisting of 100 columns and 1000 rows. The numbers are independent normals, so no true correlations here. [We put this into a Pandas DataFrame
with named columns, just to be able to use lmplot
below.]
n_cols = 100
n_rows = 1000
data = pd.DataFrame(np.random.normal(size = (n_rows, n_cols)),
columns=[str(i) for i in range(n_cols)])
We search for the pair of columns where the $p$-value for the correlation is the lowest. The number of comparisons is ${100}\choose{2}$ = $\frac{100\cdot99}{2}$ = 4950.
min_pval = np.inf
min_corr = None
min_pair = None
for i in range(n_cols):
for j in range(i+1, n_cols):
corr, pval = stats.pearsonr(data.iloc[:,i], data.iloc[:,j])
if pval < min_pval:
min_corr = corr
min_pval = pval
min_pair = i, j
min_pval, min_corr, min_pair
As you can see, we find a pair of columns where the $p$-value is very low, which seems to support a significant effect. We can also make a scatterplot of the two columns.
sns.lmplot(str(min_pair[0]), str(min_pair[1]), data);
One method to get around the problem of multiple comparison is the Bonferroni correction, where we just multiply the $p$-value by the number of comparisons. In this case, we don't find a "significant" correlation.
min_pval = np.inf
min_corr = None
min_pair = None
n_comparisons = n_cols*(n_cols-1) / 2
for i in range(n_cols):
for j in range(i+1, n_cols):
corr, pval = stats.pearsonr(data.iloc[:,i], data.iloc[:,j])
pval *= n_comparisons
if pval < min_pval:
min_corr = corr
min_pval = pval
min_pair = i, j
min_pval, min_corr, min_pair