Assignment 1: Basic data analysis and simulating probability distributions

In this assignment, you will first analyse some real estate data, and then simulate some random processes corresponding to common statistical distributions and models.

Work in groups of two or three and solve the tasks described below. Write a short report containing your answers, including the plots and create a zip file containing the report and your Python code.

Alternatively, write a Jupyter notebook including your code, plots, and comments. In this case, when you are finished editing, re-run all the cells to make sure they work and then convert your notebook into a pdf (using the print function). Submit both the .ipynb file and the .pdf file.

Submit your solution through the Canvas website.

Deadline: November 13

Didactic purpose of this assignment:

References

Part 1: Real estate prices

Your first task is to carry out a little bit of data exploration using Python tools.

Here is a CSV (comma-separated values) file listing real estate sales in England between 1995 and 2016. (Actually, to make things a bit faster it's only a subset.)

Part 2: Generating random numbers (quick detour)

Consider the random number generation functions in NumPy, documented here.

Part 3: Simulating probabilistic models

In the final part of the assignment, you will write some code to simulate a few different scenarios that correspond to well-known statistical models.

Please note. When you have implemented the code for these three scenarios, please reflect about how well you think the models correspond to the real world. What are the simplifying assumptions? Please discuss in your report.

(a) Modeling a student at an exam

Let's make a model of a student that answers questions in an exam. The exam consists of a fixed set of questions, and a student answers each question correctly or incorrectly with some fixed probability. We will now implement this model in a step-by-step fashion.

Answering a single question

Write a Python function that simulates that the student answers a single question either correctly or incorrectly. The function should return a Boolean value (that is, True or False) that says whether the question was answered correctly. You can assume that the probability of a correct answer is a given parameter p_success.

def success(p_success):
    ... YOUR CODE HERE ...

Run this function a few times and check that it seems to work correctly.

How many correctly answered questions?

Next, we make another function called exam_score that simulates an scenario where the student answers a fixed set of questions. We assume that all questions are equally difficult. As inputs, your function needs the number of questions, as well as the probability of a correct answer. The function should return the number of correctly answered questions. To implement this, it seems natural to use the function success that you developed previously.

def exam_score(p_correct, n_instances):
    ... YOUR CODE HERE ...

Again, run the function a few times and check that it seems to work as it should.

Investigating the distribution

Write some code to call exam_score several times, and collect the result of all the calls in a simple Python list, NumPy array, or Pandas Series.

Let the value of p_correct be 0.8 and n_instances be 20. Run exam_score 10,000 times and collect the results. Then plot a histogram of the results.

(b) The persistent student

We will now simulate a scenario where a student takes an exam repeatedly, until passing.

If a student does not pass an exam, the University of Gothenburg allows the student to go to an unlimited number of re-sit exams. Let's assume that students never give up, so that they will go to the exam again and again until they finally pass. Write a function that simulates a student going to exams until passing, and returns the number of attempts the student needed before passing. You can assume that the probability of passing a single exam is a constant p_pass. If you want, you can reuse your function success from the previous task: in this case, this would mean a passed exam, not just a correctly answered question.

def number_of_attempts(p_pass):
    ... YOUR CODE HERE ...

Investigating the distribution

Simulate this model multiple times, as in (a). For instance, let p_pass be 0.4. Plot the result using a histogram.

(c) An unusual village

The inhabitants of Normlösa, a small village in the fertile plains of eastern Sweden, are infamous not only for their unscrupulous behavior but also because the males in the village are exceptionally short and stocky, while the female villagers tend to be tall and lean. Geneticists from nearby Linköping University have so far failed to come up with a credible explanation of this remarkable tendency.

Write a Python function to generate the height and weight of a random inhabitant of Normlösa. Use the following process:

Generate a dataset consisting of height–weight pairs for 50 Normlösa inhabitants. Make a scatterplot of the height–weight data.

Let's pretend for a moment that you have been given the datapoints (the list of height–weight pairs) but you have no information about how they were generated. Could you think of a way to reconstruct the parameters you used in the code previously? For example, that the proportion of males is 40%, that the mean weight of a female is 60 kilograms, etc. (NB: this question should be easy if you have taken the course Introduction to data science. As an optional task, you may write code to reconstruct these parameters using the methods presented there. If you haven't taken that course, and have no idea how to answer the question, please discuss with the lab instructor.)