Assignment 1: Basic data analysis and simulating probability distributions

In this assignment, you will first analyse some real estate data, and then simulate some random processes corresponding to common statistical distributions and models.

Work in groups of two or three and solve the tasks described below. Write a short report containing your answers, including the plots and create a zip file containing the report and your Python code.

Alternatively, write a Jupyter notebook including your code, plots, and comments. In this case, when you are finished editing, re-run all the cells to make sure they work and then convert your notebook into a pdf (using the print function). Submit both the .ipynb file and the .pdf file.

Submit your solution through the Canvas website.

Deadline: November 13

Didactic purpose of this assignment:

practice some basic analysis of numerical data, using statistical libraries in Python,
get a gut feeling for the scenarios underlying some of the most common models used in statistics and data science;
get some experience in generating synthetic data by simulating in (simplified) models.

References

In Lecture 1, we saw how to plot histograms and compute basic descriptive statistics [examples], and simulate some simple random processes [overview, example 1, example 2].
Matplotlib reference documentation.
Pandas reference documentation.
NumPy random documentation.

Part 1: Real estate prices

Your first task is to carry out a little bit of data exploration using Python tools.

Here is a CSV (comma-separated values) file listing real estate sales in England between 1995 and 2016. (Actually, to make things a bit faster it's only a subset.)

Load the CSV file into Python. Use the Pandas function read_csv or use one of the techniques you learned in the course Introduction to Data Science.
The second column in the CSV file represents the price of the property. Compute basic descriptive statistics about the prices in the whole dataset: mean, median, standard deviation, minimum, and maximum.
Plot a histogram that shows the distribution of the prices. Hint: why is it so ugly? What can you do to make it more informative?
Is real estate more expensive in London? Plot histograms for the two subsets of properties inside and outside London, respectively. For practical purposes, we can define "inside London" to mean that the string in the 13th column includes the string LONDON.
Optional task. Make a plot that shows the average price per year.

Part 2: Generating random numbers (quick detour)

Consider the random number generation functions in NumPy, documented here.

Generate a set of random numbers using the function rand and plot its histogram. What is the shape of this histogram and why?
Investigate how the shape of the histogram is affected by the number of random numbers you have generated.
Instead of using rand (which corresponds to a uniform distribution), generate numbers using some other distribution and plot a histogram. What is the shape now? For instance, with normal, the normal (or Gaussian) distribution, you should get the familiar bell shape,

Part 3: Simulating probabilistic models

In the final part of the assignment, you will write some code to simulate a few different scenarios that correspond to well-known statistical models.

Please note. When you have implemented the code for these three scenarios, please reflect about how well you think the models correspond to the real world. What are the simplifying assumptions? Please discuss in your report.

(a) Modeling a student at an exam

Let's make a model of a student that answers questions in an exam. The exam consists of a fixed set of questions, and a student answers each question correctly or incorrectly with some fixed probability. We will now implement this model in a step-by-step fashion.

Answering a single question

Write a Python function that simulates that the student answers a single question either correctly or incorrectly. The function should return a Boolean value (that is, True or False) that says whether the question was answered correctly. You can assume that the probability of a correct answer is a given parameter p_success.

def success(p_success):
    ... YOUR CODE HERE ...

Run this function a few times and check that it seems to work correctly.

How many correctly answered questions?

Next, we make another function called exam_score that simulates an scenario where the student answers a fixed set of questions. We assume that all questions are equally difficult. As inputs, your function needs the number of questions, as well as the probability of a correct answer. The function should return the number of correctly answered questions. To implement this, it seems natural to use the function success that you developed previously.

def exam_score(p_correct, n_instances):
    ... YOUR CODE HERE ...

Again, run the function a few times and check that it seems to work as it should.

Investigating the distribution

Write some code to call exam_score several times, and collect the result of all the calls in a simple Python list, NumPy array, or Pandas Series.

Let the value of p_correct be 0.8 and n_instances be 20. Run exam_score 10,000 times and collect the results. Then plot a histogram of the results.

(b) The persistent student

We will now simulate a scenario where a student takes an exam repeatedly, until passing.

If a student does not pass an exam, the University of Gothenburg allows the student to go to an unlimited number of re-sit exams. Let's assume that students never give up, so that they will go to the exam again and again until they finally pass. Write a function that simulates a student going to exams until passing, and returns the number of attempts the student needed before passing. You can assume that the probability of passing a single exam is a constant p_pass. If you want, you can reuse your function success from the previous task: in this case, this would mean a passed exam, not just a correctly answered question.

def number_of_attempts(p_pass):
    ... YOUR CODE HERE ...

Investigating the distribution

Simulate this model multiple times, as in (a). For instance, let p_pass be 0.4. Plot the result using a histogram.

(c) An unusual village

The inhabitants of Normlösa, a small village in the fertile plains of eastern Sweden, are infamous not only for their unscrupulous behavior but also because the males in the village are exceptionally short and stocky, while the female villagers tend to be tall and lean. Geneticists from nearby Linköping University have so far failed to come up with a credible explanation of this remarkable tendency.

Write a Python function to generate the height and weight of a random inhabitant of Normlösa. Use the following process:

first, randomly select the gender of the villager; the proportion of males in this village is about 40%.
then draw random numbers from a Gaussian distribution (normal distribution) for the height and weight of the person; for this, you might use the NumPy function np.random.normal(loc, scale), where loc is the mean and scale the standard deviation.
- for males, the mean height is 140 and the height standard deviation is 15; the mean weight is 90 and the weight standard deviation is 10;
- for females, the mean height is 195 and the height standard deviation is 10; the mean weight is 60 and the weight standard deviation is 5.

Generate a dataset consisting of height–weight pairs for 50 Normlösa inhabitants. Make a scatterplot of the height–weight data.

Let's pretend for a moment that you have been given the datapoints (the list of height–weight pairs) but you have no information about how they were generated. Could you think of a way to reconstruct the parameters you used in the code previously? For example, that the proportion of males is 40%, that the mean weight of a female is 60 kilograms, etc. (NB: this question should be easy if you have taken the course Introduction to data science. As an optional task, you may write code to reconstruct these parameters using the methods presented there. If you haven't taken that course, and have no idea how to answer the question, please discuss with the lab instructor.)