We'll do some basic data processing: reading the data, computing descriptive statistics, and plotting histograms and scatterplots. We'll do this in a few different ways, to show how the various libraries are used. First, we import the Python libraries we'll use:
import pandas as pd
import matplotlib.pyplot as plt # basic plotting library
import seaborn as sns # additional plotting functions
import scipy
import scipy.stats as stats
import numpy as np
plt.style.use('seaborn-darkgrid') # nicer looking plots
We use the utility function read_csv
from Pandas to read a CSV (comma-separated values) file. This function returns a Pandas DataFrame
, basically like a table with named columns.
data = pd.read_csv('bodies.txt', sep=' ', header=None,
names=['height', 'weight', 'gender'])
data.head()
The method describe()
computes some basic descriptive statistics about the Pandas DataFrame
.
data.describe()
Compute the same basic descriptive statistics using functions in scipy and numpy.
scipy.mean(data['height']), scipy.median(data['height'])
data['height'].mean(), data['height'].median()
np.percentile(data['height'], 25)
Plot histograms using matplotlib, Pandas, and Seaborn.
data['height'].hist();
plt.hist(data.height, bins=10);
sns.distplot(data['height'], bins=10);
A slightly more complicated examples, where we first plot a histogram using Seaborn, and then add two red lines representing the 5% and 95% percentiles. (In this example, we're using a different dataset in order to make the histogram look a bit nicer.)
data_synthetic = pd.read_csv('synthetic.txt', sep=' ',
header=None, names=['height', 'weight', 'gender'])
sns.distplot(data_synthetic['height'], bins=10, kde=True);
p5 = data_synthetic['height'].quantile(0.05)
p95 = data_synthetic['height'].quantile(0.95)
plt.plot([p5, p5], [0, 0.05], 'r')
plt.plot([p95, p95], [0, 0.05], 'r');
Back to the first dataset. We now consider the relation between two columns, in this case the height and the weight. We first compute the correlation (in two different ways).
stats.pearsonr(data['height'], data['weight'])
data['height'].corr(data['weight'])
The scatterplot is a basic tool for visualizing relations in data. Here is a short tour of the different plotting libraries, to showcase some ways you can draw scatterplots (and some density plots).
data.plot.scatter('height', 'weight');
sns.lmplot(x='height', y='weight', data=data);
sns.lmplot(x='height', y='weight', data=data, hue='gender');
sns.lmplot(x='height', y='weight', data=data, hue='gender', fit_reg=False);
sns.jointplot(x='height', y='weight', data=data, kind='scatter');
sns.kdeplot(data['height'], data['weight']);
sns.kdeplot(data['height'], data['weight'], shade=True);
plt.scatter(data['height'], data['weight']);
men = data[data['gender'] == 'm']; women = data[data['gender'] == 'f']
plt.scatter(men['height'], men['weight'], color='r')
plt.scatter(women['height'], women['weight'], color='b');
plt.plot(men['height'], men['weight'], 'g.',
women['height'], women['weight'], 'r.');
sns.kdeplot(men['height'], men['weight'], cmap='Reds')
sns.kdeplot(women['height'], women['weight'], cmap='Blues');