\n", "$\\qquad$ $\\qquad$$\\qquad$ **Goal: Classification**

\n", "$\\qquad$ $\\qquad$$\\qquad$ **Grader: Divya**

\n", "$\\qquad$ $\\qquad$$\\qquad$ **Due Date: 23/4**

\n", "$\\qquad$ $\\qquad$$\\qquad$ **Submitted by: Name, Personal no., email**

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "General guidelines:\n", "* All solutions to theoretical problems, can be submitted as a single file named *report.pdf*. They can also be submitted in this ipynb notebook, but equations wherever required, should be formatted using LaTeX math-mode.\n", "* All discussion regarding practical problems, along with solutions and plots should be specified here itself. We will not generate the solutions/plots again by running your code.\n", "* Your name, personal number and email address should be specified above and also in your file *report.pdf*.\n", "* All datasets can be downloaded from the course website.\n", "* All tables and other additional information should be included." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Theoretical problems\n", "\n", "## [Naive Bayes Classifier, 6 points]\n", "\n", "A psychologist does a small survey on ''happiness''. Each respondent provides a vector with entries 1 or 0 corresponding to if they answered “yes” or “no” to a question respectively. The question vector has attributes \n", "$$\n", "x = (\\mbox{rich, married, healthy}) \\tag{1}\n", "$$\n", "\n", "Thus a response $(1, 0, 1)$ would indicate that the respondent was\n", "''rich'', ''unmarried'' and ''healthy''. In addition, each respondent\n", "gives a value $c = 1$ if they are content wih their life and $c = 0$\n", "if they’re not. The following responses were obtained.\n", "\n", "$$\n", "c = 1: (1, 1, 1),(0, 0, 1),(1, 1, 0),(1, 0, 1) \\\\\n", "c = 0: (0, 0, 0),(1, 0, 0),(0, 0, 1),(0, 1, 0)\n", "$$\n", "\n", "1. Using naive Bayes, what is the probability that a person is ''not rich'', ''married'' and ''healthy'' is ''content''?\n", "\n", "2. What is the probability that a person who is ''not rich'' and ''married'' is content ? (i.e. we do not know if they are ''healthy'')\n", "\n", "## [Extending Naive Bayes, 4 points]\n", "\n", "Consider now, the following vector of attributes:\n", "\n", "* $x_1 = 1$ if customer is younger than 20 and 0 otherwise.\n", "* $x_2 = 1$ if customer is between 20 and 30 in age, and 0 otherwise.\n", "* $x_3 = 1$ if customer is older than 30 and 0 otherwise\n", "* $x_4 = 1$ if customer walks to work and 0 otherwise.\n", "\n", "Each vector of attributes has a label ''rich'' or ''poor''. Point out potential difficulties with your approach above to training using naive Bayes. Suggest and describe how to extend your naive Bayes method to this dataset.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Practical problems\n", "\n", "## [Bayes classifier, 5 points]\n", "\n", "Dowload the dataset **\"dataset2.txt\"**. You can use the following code for example:\n", "```python\n", "from numpy import genfromtxt\n", "data = genfromtxt('dataset2.txt', delimiter=',')\n", "labels = data[:,-1]\n", "```\n", "The dataset contains $3$-dimensional data, $X$, generated from $2$ classes with labels, $y$ either $+1$ or $-1$. Each row of $X$ and $y$ contain one observation and one label respectively. There are $1000$ instances of each class. \n", "\n", "a. Assume that the class conditional density is spherical Gaussian, and both classes have equal prior. Write the expression for the Bayes ( not **naive Bayes**) classifier i.e. derive\n", "$$\n", "P(y_{new} = -1 | x_{new} , X, y ) \\\\\n", "P(y_{new} = +1 | x_{new} , X, y ) ~.\n", "$$\n", "\n", "It is useful to note that the dependence on training data $X, y$ for class $1$ can be expressed as: \n", "\n", "$$ \n", "P( x_{new} | y_{new} = 1, X, y) = P(x_{new} |\n", "\\hat{\\mu}_{1}, \\hat{\\sigma}^{2}_{1})\n", "$$\n", "\n", "where $\\hat{\\mu}_{1} \\in \\mathbb{R}^3$ and $\\hat{\\sigma}^{2}_{1}\\in \\mathbb{R}$ are MLE estimates for mean (3-dimensional) and variance based on training data with label $+1$ (and similarly for class 2 with label $-1$). \n", "\n", "b. Implement a function **sph_bayes()** which computes the probability of a new test point *Xtest* coming from class $1$ ($P1$) and class $2$ ($P2$). Finally, assign a label *Ytest* to the test point based on the probabilities $P1$ and $P2$.\n", "\n", "```python\n", "def sph_bayes(Xtest, ...): # other parameters needed.\n", "\n", " return [P1, P2, Ytest]\n", "```\n", "c. Write a function **new_classifier()**\n", "\n", "```python\n", "def new_classifier(Xtest, mu1, mu2)\n", " \n", " return [Ytest]\n", "```\n", "which implements the following classifier,\n", "$$\n", "f(x) = \\mbox{sign}\\left(\\frac{(\\mu_1 - \\mu_2)^\\top (x - b) }{\\|\\mu_1 - \\mu_2\\|_2} \\right)\n", "$$\n", "with $b = \\frac{1}{2}(\\mu_1 + \\mu_2)$.\n", "\n", "d. Report 5-fold cross validation error for both classifiers.\n", "\n", "## [DIGITS dataset classifer, 5 points]\n", "\n", "Load the DIGITS dataset:\n", "```python\n", "from sklearn import datasets\n", "digits = datasets.load_digits()\n", "```\n", "This dataset contains $1797$ samples of ten handwritten digit classes. You can further query and visualize the dataset using the various attributes of the returned dictionary:\n", "```python\n", "data = digits.data\n", "print(data.shape)\n", "target_names = digits.target_names\n", "print (target_names)\n", "import matplotlib.pyplot as plt\n", "y = digits.target\n", "plt.matshow(digits.images[0])\n", "plt.show()\n", "```\n", "\n", "a. Use **new_classifier()** designed previously to do binary classification between classes representing digits \"*5*\" and \"*8*\".\n", "\n", "b. Investigate an alternative feature function as described below:\n", "\n", "1. Scale each pixel value to range $[0, 1] $ from original gray-scale ($0-255$). \n", "2. Compute variance of each row and column of the image. This will give you a new feature vector of size $16$ i.e. \n", "\n", "$$ \n", "x' = \\left[ \\; Var(row_1) , Var(row_2), \\ldots , Var(row_{8}), Var(col_1), \\ldots, Var(col_{8}) \\;\\right]^T\n", "$$\n", "\n", "c. Report $5$-fold cross validation results for parts $(a)$ and\n", "$(b)$ in a single table. What can you say about the results?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }