$p$-hacking illustration¶

We show how to fish for "statistically significant" correlations in completely random data. "For everyone who asks receives; the one who seeks finds..."

from scipy import stats
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
plt.style.use('seaborn-darkgrid')
%config InlineBackend.figure_format = 'retina'

The following is just to demonstrate the use of the SciPy function pearsonr. It gives the sample correlation, and also a $p$-value with respect to the null hypothesis that the true correlation is zero. In this case, we create two arrays x and y that are constructed to be correlated, and we see a strong correlation and a low $p$-value.

x = np.random.normal(size=50)
y = x + np.random.normal(scale=0.5, size=50)
plt.scatter(x, y)
stats.pearsonr(x, y)

(0.8201249243698255, 3.1825575822343794e-13)

We create a numerical dataset consisting of 100 columns and 1000 rows. The numbers are independent normals, so no true correlations here. [We put this into a Pandas DataFrame with named columns, just to be able to use lmplot below.]

n_cols = 100
n_rows = 1000

data = pd.DataFrame(np.random.normal(size = (n_rows, n_cols)), 
                    columns=[str(i) for i in range(n_cols)])

We search for the pair of columns where the $p$-value for the correlation is the lowest. The number of comparisons is ${100}\choose{2}$ = $\frac{100\cdot99}{2}$ = 4950.

min_pval = np.inf
min_corr = None
min_pair = None

for i in range(n_cols):
    for j in range(i+1, n_cols):
        corr, pval = stats.pearsonr(data.iloc[:,i], data.iloc[:,j])
                
        if pval < min_pval:
            min_corr = corr
            min_pval = pval
            min_pair = i, j

min_pval, min_corr, min_pair

(0.00019881428016333365, -0.11739292410750021, (9, 59))

As you can see, we find a pair of columns where the $p$-value is very low, which seems to support a significant effect. We can also make a scatterplot of the two columns.

sns.lmplot(str(min_pair[0]), str(min_pair[1]), data);

One method to get around the problem of multiple comparison is the Bonferroni correction, where we just multiply the $p$-value by the number of comparisons. In this case, we don't find a "significant" correlation.

min_pval = np.inf
min_corr = None
min_pair = None

n_comparisons = n_cols*(n_cols-1) / 2

for i in range(n_cols):
    for j in range(i+1, n_cols):
        corr, pval = stats.pearsonr(data.iloc[:,i], data.iloc[:,j])

        pval *= n_comparisons
        
        if pval < min_pval:
            min_corr = corr
            min_pval = pval
            min_pair = i, j

min_pval, min_corr, min_pair

(0.9841306868085016, -0.11739292410750021, (9, 59))