Investigating distributions using Q-Q plots

We will use quantile-quantile (Q-Q) plots to investigate whether some data seems to have been sampled from some given distribution.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt
plt.style.use('seaborn-darkgrid')

As an introduction, we first consider the case of two different samples. To keep things simple, we assume that they have the same size. Do we think that they have been sampled from the same distribution? In the example below, we will first investigate two different samples from the same normal distribution.

To compare the two samples, we first sort them in ascending order. Intuitively, if the two samples come from the same distribution, the items at a given position in the sorted samples will tend to be "similar". (For instance, we'd expect the 10% quantiles, the 20% quantiles, the medians, and so on, to be similar.) We visualize this idea by plotting the sorted samples, using one sample for the x axis and the other one for the y axis. If the samples have been generated from the same distribution, this plot will show something similar to a straight line. (For large sample sizes, the line becomes clearer.)

In [2]:
n = 1000
sample1 = np.random.normal(3, 1, size=n)
sample2 = np.random.normal(3, 1, size=n)

plt.plot(sorted(sample1), sorted(sample2), '.');

In the example above, we used two samples from the same distribution and the plot was a straight line. On the other hand, if we compare two samples from different distributions, the plot is clearly not a straight line.

In [3]:
plt.figure()
sample3 = np.random.random(size=n)
plt.plot(sorted(sample1), sorted(sample3), '.');

Now, instead of comparing two samples, let's assume that we have one sample and we ask if that sample was generated by a normal distribution. In that case, we replace the second sample by the theoretical quantiles defined by the distribution. (That is, the ppf function in the SciPy distribution objects.) For instance, we'd like the 10% quantile of the data to be "similar" to the 10% quantile of the distribution, etc.

This type of plot is called a quantile-quantile (or Q-Q) plot. The code below shows how we can implement our own function to create a Q-Q plot. We then compare one of our sample to a normal distribution, and we get a nice straight line. This is expected, since we know that this sample was created by sampling normally distributed random numbers.

In [4]:
def my_own_qqplot(data, distr):
    n = len(data)    
    quantiles = [ distr.ppf(i / n) for i in range(0, n) ]
    plt.plot(quantiles, sorted(data), '.')
    
my_own_qqplot(sample1, stats.norm)

However, typically we wouldn't make our own function to create Q-Q plots. SciPy has a built-in function for that purpose. (If we don't specify a distribution, it will use a normal distribution by default.)

In [5]:
stats.probplot(sample1, plot=plt);

... and again, the plot shows that the sample from a uniform distribution does not resemble a normal distribution.

In [6]:
stats.probplot(sample3, plot=plt);

We can optionally provide the distribution that we'd like to compare the sample to. Here, we compare the sample from a uniform distribution to the uniform distribution, so we get a straight line.

In [7]:
stats.probplot(sample3, plot=plt, dist=stats.uniform);