In this assignment, you will first analyse some real estate data, and then simulate some random processes corresponding to common statistical distributions and models.
Work in groups of two or three and solve the tasks described below. Write a short report containing your answers, including the plots and create a zip file containing the report and your Python code.
Alternatively, write a Jupyter notebook including your code, plots, and comments. In this case, when you are finished editing, re-run all the cells to make sure they work and then convert your notebook into a pdf (using the print function). Submit both the .ipynb file and the .pdf file.
Submit your solution through the Canvas website.
Deadline: November 13
Didactic purpose of this assignment:
Your first task is to carry out a little bit of data exploration using Python tools.
Here is a CSV (comma-separated values) file listing real estate sales in England between 1995 and 2016. (Actually, to make things a bit faster it's only a subset.)
read_csv
or use one of the techniques you learned in the course Introduction to Data Science.Consider the random number generation functions in NumPy, documented here.
rand
and plot its histogram. What is the shape of this histogram and why?
rand
(which corresponds to a uniform distribution), generate numbers using some other distribution and plot a histogram. What is the shape now?
For instance, with normal
, the normal (or Gaussian) distribution, you should get the familiar bell shape,
In the final part of the assignment, you will write some code to simulate a few different scenarios that correspond to well-known statistical models.
Please note. When you have implemented the code for these three scenarios, please reflect about how well you think the models correspond to the real world. What are the simplifying assumptions? Please discuss in your report.
Let's make a model of a student that answers questions in an exam. The exam consists of a fixed set of questions, and a student answers each question correctly or incorrectly with some fixed probability. We will now implement this model in a step-by-step fashion.
Answering a single question
Write a Python function that simulates that the student answers a single question either correctly or incorrectly.
The function should return a Boolean value (that is, True
or False
) that says whether the question was answered correctly.
You can assume that the probability of a correct answer is a given parameter p_success
.
def success(p_success): ... YOUR CODE HERE ...
Run this function a few times and check that it seems to work correctly.
Formally, we say that this function simulates a random variable with a Bernoulli distribution. The metaphor typically used is that of a coin toss with an unfair coin.
How many correctly answered questions?
Next, we make another function called exam_score
that simulates an scenario where the student answers a fixed set of questions. We assume that all questions are equally difficult.
As inputs, your function needs the number of questions, as well as the probability of a correct answer.
The function should return the number of correctly answered questions.
To implement this, it seems natural to use the function success
that you developed previously.
def exam_score(p_correct, n_instances): ... YOUR CODE HERE ...
Again, run the function a few times and check that it seems to work as it should.
Investigating the distribution
Write some code to call exam_score
several times, and collect the result of all the calls in a simple Python list, NumPy array, or Pandas Series
.
Let the value of p_correct
be 0.8 and n_instances
be 20. Run exam_score
10,000 times and collect the results. Then plot a histogram of the results.
This type of scenario corresponds to the binomial distribution which we will discuss formally in the next lecture. The typical explanation is that we toss an unfair coin a given number times and count the number of times the heads side came up.
We will now simulate a scenario where a student takes an exam repeatedly, until passing.
If a student does not pass an exam, the University of Gothenburg allows the student to go to an unlimited number of re-sit exams.
Let's assume that students never give up, so that they will go to the exam again and again until they finally pass.
Write a function that simulates a student going to exams until passing, and returns the number of attempts the student needed before passing. You can assume that the probability of passing a single exam is a constant p_pass
.
If you want, you can reuse your function success
from the previous task: in this case, this would mean a passed exam, not just a correctly answered question.
def number_of_attempts(p_pass): ... YOUR CODE HERE ...
Investigating the distribution
Simulate this model multiple times, as in (a). For instance, let p_pass
be 0.4. Plot the result using a histogram.
This type of scenario corresponds to the geometric distribution.
The inhabitants of Normlösa, a small village in the fertile plains of eastern Sweden, are infamous not only for their unscrupulous behavior but also because the males in the village are exceptionally short and stocky, while the female villagers tend to be tall and lean. Geneticists from nearby Linköping University have so far failed to come up with a credible explanation of this remarkable tendency.
Write a Python function to generate the height and weight of a random inhabitant of Normlösa. Use the following process:
np.random.normal(loc, scale)
, where loc
is the mean and scale
the standard deviation.
Generate a dataset consisting of height–weight pairs for 50 Normlösa inhabitants. Make a scatterplot of the height–weight data.
Let's pretend for a moment that you have been given the datapoints (the list of height–weight pairs) but you have no information about how they were generated. Could you think of a way to reconstruct the parameters you used in the code previously? For example, that the proportion of males is 40%, that the mean weight of a female is 60 kilograms, etc. (NB: this question should be easy if you have taken the course Introduction to data science. As an optional task, you may write code to reconstruct these parameters using the methods presented there. If you haven't taken that course, and have no idea how to answer the question, please discuss with the lab instructor.)
Formally, these data points are generated using a Gaussian mixture model. We will come back to this model and study it more extensively in later lectures.