Lecture 1: basic data analysis

We'll do some basic data processing: reading the data, computing descriptive statistics, and plotting histograms and scatterplots. We'll do this in a few different ways, to show how the various libraries are used. First, we import the Python libraries we'll use:

In [1]:
import pandas as pd

import matplotlib.pyplot as plt # basic plotting library
import seaborn as sns           # additional plotting functions

import scipy
import scipy.stats as stats
import numpy as np

plt.style.use('seaborn-darkgrid') # nicer looking plots

We use the utility function read_csv from Pandas to read a CSV (comma-separated values) file. This function returns a Pandas DataFrame, basically like a table with named columns.

In [2]:
data = pd.read_csv('bodies.txt', sep=' ', header=None, 
                   names=['height', 'weight', 'gender'])
data.head()
Out[2]:
height weight gender
0 182 86 m
1 193 112 m
2 172 72 m
3 170 61 f
4 167 58 f

The method describe() computes some basic descriptive statistics about the Pandas DataFrame.

In [3]:
data.describe()
Out[3]:
height weight
count 22.000000 22.000000
mean 169.590909 69.636364
std 10.140250 15.941994
min 152.000000 47.000000
25% 165.000000 58.750000
50% 167.000000 66.000000
75% 177.000000 82.000000
max 193.000000 112.000000

Compute the same basic descriptive statistics using functions in scipy and numpy.

In [4]:
scipy.mean(data['height']), scipy.median(data['height'])
Out[4]:
(169.5909090909091, 167.0)
In [5]:
data['height'].mean(), data['height'].median()
Out[5]:
(169.5909090909091, 167.0)
In [6]:
np.percentile(data['height'], 25)
Out[6]:
165.0

Plot histograms using matplotlib, Pandas, and Seaborn.

In [7]:
data['height'].hist();
In [8]:
plt.hist(data.height, bins=10);
In [9]:
sns.distplot(data['height'], bins=10);
/home/richard/external-tools/miniconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

A slightly more complicated examples, where we first plot a histogram using Seaborn, and then add two red lines representing the 5% and 95% percentiles. (In this example, we're using a different dataset in order to make the histogram look a bit nicer.)

In [10]:
data_synthetic = pd.read_csv('synthetic.txt', sep=' ', 
                             header=None, names=['height', 'weight', 'gender'])
sns.distplot(data_synthetic['height'], bins=10, kde=True);
p5 = data_synthetic['height'].quantile(0.05)
p95 = data_synthetic['height'].quantile(0.95)
plt.plot([p5, p5], [0, 0.05], 'r')
plt.plot([p95, p95], [0, 0.05], 'r');
/home/richard/external-tools/miniconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Back to the first dataset. We now consider the relation between two columns, in this case the height and the weight. We first compute the correlation (in two different ways).

In [11]:
stats.pearsonr(data['height'], data['weight'])
Out[11]:
(0.8718485977106544, 1.2573572624134316e-07)
In [12]:
data['height'].corr(data['weight'])
Out[12]:
0.8718485977106544

The scatterplot is a basic tool for visualizing relations in data. Here is a short tour of the different plotting libraries, to showcase some ways you can draw scatterplots (and some density plots).

In [13]:
data.plot.scatter('height', 'weight');
In [14]:
sns.lmplot(x='height', y='weight', data=data);
In [15]:
sns.lmplot(x='height', y='weight', data=data, hue='gender');
In [16]:
sns.lmplot(x='height', y='weight', data=data, hue='gender', fit_reg=False);
In [17]:
sns.jointplot(x='height', y='weight', data=data, kind='scatter');
/home/richard/external-tools/miniconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/richard/external-tools/miniconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
In [18]:
sns.kdeplot(data['height'], data['weight']);
In [19]:
sns.kdeplot(data['height'], data['weight'], shade=True);
In [20]:
plt.scatter(data['height'], data['weight']);
In [21]:
men = data[data['gender'] == 'm']; women = data[data['gender'] == 'f']
plt.scatter(men['height'], men['weight'], color='r')
plt.scatter(women['height'], women['weight'], color='b');
In [22]:
plt.plot(men['height'], men['weight'], 'g.', 
         women['height'], women['weight'], 'r.');
In [23]:
sns.kdeplot(men['height'], men['weight'], cmap='Reds')
sns.kdeplot(women['height'], women['weight'], cmap='Blues');