Lecture 1: basic data analysis¶

We'll do some basic data processing: reading the data, computing descriptive statistics, and plotting histograms and scatterplots. We'll do this in a few different ways, to show how the various libraries are used. First, we import the Python libraries we'll use:

import pandas as pd

import matplotlib.pyplot as plt # basic plotting library
import seaborn as sns           # additional plotting functions

import scipy
import scipy.stats as stats
import numpy as np

plt.style.use('seaborn-darkgrid') # nicer looking plots

We use the utility function read_csv from Pandas to read a CSV (comma-separated values) file. This function returns a Pandas DataFrame, basically like a table with named columns.

data = pd.read_csv('bodies.txt', sep=' ', header=None, 
                   names=['height', 'weight', 'gender'])
data.head()

The method describe() computes some basic descriptive statistics about the Pandas DataFrame.

data.describe()

Compute the same basic descriptive statistics using functions in scipy and numpy.

scipy.mean(data['height']), scipy.median(data['height'])

(169.5909090909091, 167.0)

data['height'].mean(), data['height'].median()

(169.5909090909091, 167.0)

np.percentile(data['height'], 25)

165.0

Plot histograms using matplotlib, Pandas, and Seaborn.

data['height'].hist();

plt.hist(data.height, bins=10);

sns.distplot(data['height'], bins=10);

/home/richard/external-tools/miniconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

A slightly more complicated examples, where we first plot a histogram using Seaborn, and then add two red lines representing the 5% and 95% percentiles. (In this example, we're using a different dataset in order to make the histogram look a bit nicer.)

data_synthetic = pd.read_csv('synthetic.txt', sep=' ', 
                             header=None, names=['height', 'weight', 'gender'])
sns.distplot(data_synthetic['height'], bins=10, kde=True);
p5 = data_synthetic['height'].quantile(0.05)
p95 = data_synthetic['height'].quantile(0.95)
plt.plot([p5, p5], [0, 0.05], 'r')
plt.plot([p95, p95], [0, 0.05], 'r');

/home/richard/external-tools/miniconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Back to the first dataset. We now consider the relation between two columns, in this case the height and the weight. We first compute the correlation (in two different ways).

stats.pearsonr(data['height'], data['weight'])

(0.8718485977106544, 1.2573572624134316e-07)

data['height'].corr(data['weight'])

0.8718485977106544

The scatterplot is a basic tool for visualizing relations in data. Here is a short tour of the different plotting libraries, to showcase some ways you can draw scatterplots (and some density plots).

data.plot.scatter('height', 'weight');

sns.lmplot(x='height', y='weight', data=data);

sns.lmplot(x='height', y='weight', data=data, hue='gender');

sns.lmplot(x='height', y='weight', data=data, hue='gender', fit_reg=False);

sns.jointplot(x='height', y='weight', data=data, kind='scatter');

/home/richard/external-tools/miniconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
/home/richard/external-tools/miniconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

sns.kdeplot(data['height'], data['weight']);

sns.kdeplot(data['height'], data['weight'], shade=True);

plt.scatter(data['height'], data['weight']);

men = data[data['gender'] == 'm']; women = data[data['gender'] == 'f']
plt.scatter(men['height'], men['weight'], color='r')
plt.scatter(women['height'], women['weight'], color='b');

plt.plot(men['height'], men['weight'], 'g.', 
         women['height'], women['weight'], 'r.');

sns.kdeplot(men['height'], men['weight'], cmap='Reds')
sns.kdeplot(women['height'], women['weight'], cmap='Blues');

	height	weight	gender
0	182	86	m
1	193	112	m
2	172	72	m
3	170	61	f
4	167	58	f

	height	weight
count	22.000000	22.000000
mean	169.590909	69.636364
std	10.140250	15.941994
min	152.000000	47.000000
25%	165.000000	58.750000
50%	167.000000	66.000000
75%	177.000000	82.000000
max	193.000000	112.000000