{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$\\qquad$ $\\qquad$$\\qquad$ **TDA 231 Machine Learning: Homework 2**
\n",
"$\\qquad$ $\\qquad$$\\qquad$ **Goal: Classification**
\n",
"$\\qquad$ $\\qquad$$\\qquad$ **Grader: Divya**
\n",
"$\\qquad$ $\\qquad$$\\qquad$ **Due Date: 23/4**
\n",
"$\\qquad$ $\\qquad$$\\qquad$ **Submitted by: Name, Personal no., email**
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"General guidelines:\n",
"* All solutions to theoretical problems, can be submitted as a single file named *report.pdf*. They can also be submitted in this ipynb notebook, but equations wherever required, should be formatted using LaTeX math-mode.\n",
"* All discussion regarding practical problems, along with solutions and plots should be specified here itself. We will not generate the solutions/plots again by running your code.\n",
"* Your name, personal number and email address should be specified above and also in your file *report.pdf*.\n",
"* All datasets can be downloaded from the course website.\n",
"* All tables and other additional information should be included."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Theoretical problems\n",
"\n",
"## [Naive Bayes Classifier, 6 points]\n",
"\n",
"A psychologist does a small survey on ''happiness''. Each respondent provides a vector with entries 1 or 0 corresponding to if they answered “yes” or “no” to a question respectively. The question vector has attributes \n",
"$$\n",
"x = (\\mbox{rich, married, healthy}) \\tag{1}\n",
"$$\n",
"\n",
"Thus a response $(1, 0, 1)$ would indicate that the respondent was\n",
"''rich'', ''unmarried'' and ''healthy''. In addition, each respondent\n",
"gives a value $c = 1$ if they are content wih their life and $c = 0$\n",
"if they’re not. The following responses were obtained.\n",
"\n",
"$$\n",
"c = 1: (1, 1, 1),(0, 0, 1),(1, 1, 0),(1, 0, 1) \\\\\n",
"c = 0: (0, 0, 0),(1, 0, 0),(0, 0, 1),(0, 1, 0)\n",
"$$\n",
"\n",
"1. Using naive Bayes, what is the probability that a person is ''not rich'', ''married'' and ''healthy'' is ''content''?\n",
"\n",
"2. What is the probability that a person who is ''not rich'' and ''married'' is content ? (i.e. we do not know if they are ''healthy'')\n",
"\n",
"## [Extending Naive Bayes, 4 points]\n",
"\n",
"Consider now, the following vector of attributes:\n",
"\n",
"* $x_1 = 1$ if customer is younger than 20 and 0 otherwise.\n",
"* $x_2 = 1$ if customer is between 20 and 30 in age, and 0 otherwise.\n",
"* $x_3 = 1$ if customer is older than 30 and 0 otherwise\n",
"* $x_4 = 1$ if customer walks to work and 0 otherwise.\n",
"\n",
"Each vector of attributes has a label ''rich'' or ''poor''. Point out potential difficulties with your approach above to training using naive Bayes. Suggest and describe how to extend your naive Bayes method to this dataset.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Practical problems\n",
"\n",
"## [Bayes classifier, 5 points]\n",
"\n",
"Dowload the dataset **\"dataset2.txt\"**. You can use the following code for example:\n",
"```python\n",
"from numpy import genfromtxt\n",
"data = genfromtxt('dataset2.txt', delimiter=',')\n",
"labels = data[:,-1]\n",
"```\n",
"The dataset contains $3$-dimensional data, $X$, generated from $2$ classes with labels, $y$ either $+1$ or $-1$. Each row of $X$ and $y$ contain one observation and one label respectively. There are $1000$ instances of each class. \n",
"\n",
"a. Assume that the class conditional density is spherical Gaussian, and both classes have equal prior. Write the expression for the Bayes ( not **naive Bayes**) classifier i.e. derive\n",
"$$\n",
"P(y_{new} = -1 | x_{new} , X, y ) \\\\\n",
"P(y_{new} = +1 | x_{new} , X, y ) ~.\n",
"$$\n",
"\n",
"It is useful to note that the dependence on training data $X, y$ for class $1$ can be expressed as: \n",
"\n",
"$$ \n",
"P( x_{new} | y_{new} = 1, X, y) = P(x_{new} |\n",
"\\hat{\\mu}_{1}, \\hat{\\sigma}^{2}_{1})\n",
"$$\n",
"\n",
"where $\\hat{\\mu}_{1} \\in \\mathbb{R}^3$ and $\\hat{\\sigma}^{2}_{1}\\in \\mathbb{R}$ are MLE estimates for mean (3-dimensional) and variance based on training data with label $+1$ (and similarly for class 2 with label $-1$). \n",
"\n",
"b. Implement a function **sph_bayes()** which computes the probability of a new test point *Xtest* coming from class $1$ ($P1$) and class $2$ ($P2$). Finally, assign a label *Ytest* to the test point based on the probabilities $P1$ and $P2$.\n",
"\n",
"```python\n",
"def sph_bayes(Xtest, ...): # other parameters needed.\n",
"\n",
" return [P1, P2, Ytest]\n",
"```\n",
"c. Write a function **new_classifier()**\n",
"\n",
"```python\n",
"def new_classifier(Xtest, mu1, mu2)\n",
" \n",
" return [Ytest]\n",
"```\n",
"which implements the following classifier,\n",
"$$\n",
"f(x) = \\mbox{sign}\\left(\\frac{(\\mu_1 - \\mu_2)^\\top (x - b) }{\\|\\mu_1 - \\mu_2\\|_2} \\right)\n",
"$$\n",
"with $b = \\frac{1}{2}(\\mu_1 + \\mu_2)$.\n",
"\n",
"d. Report 5-fold cross validation error for both classifiers.\n",
"\n",
"## [DIGITS dataset classifer, 5 points]\n",
"\n",
"Load the DIGITS dataset:\n",
"```python\n",
"from sklearn import datasets\n",
"digits = datasets.load_digits()\n",
"```\n",
"This dataset contains $1797$ samples of ten handwritten digit classes. You can further query and visualize the dataset using the various attributes of the returned dictionary:\n",
"```python\n",
"data = digits.data\n",
"print(data.shape)\n",
"target_names = digits.target_names\n",
"print (target_names)\n",
"import matplotlib.pyplot as plt\n",
"y = digits.target\n",
"plt.matshow(digits.images[0])\n",
"plt.show()\n",
"```\n",
"\n",
"a. Use **new_classifier()** designed previously to do binary classification between classes representing digits \"*5*\" and \"*8*\".\n",
"\n",
"b. Investigate an alternative feature function as described below:\n",
"\n",
"1. Scale each pixel value to range $[0, 1] $ from original gray-scale ($0-255$). \n",
"2. Compute variance of each row and column of the image. This will give you a new feature vector of size $16$ i.e. \n",
"\n",
"$$ \n",
"x' = \\left[ \\; Var(row_1) , Var(row_2), \\ldots , Var(row_{8}), Var(col_1), \\ldots, Var(col_{8}) \\;\\right]^T\n",
"$$\n",
"\n",
"c. Report $5$-fold cross validation results for parts $(a)$ and\n",
"$(b)$ in a single table. What can you say about the results?"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}