{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "$\\qquad$ $\\qquad$$\\qquad **TDA 231 Machine Learning: Homework 2** \n", "\\qquad \\qquad$$\\qquad$ **Goal: Classification**
\n", "$\\qquad$ $\\qquad$$\\qquad **Grader: Divya** \n", "\\qquad \\qquad$$\\qquad$ **Due Date: 23/4**
\n", "$\\qquad$ $\\qquad$$\\qquad **Submitted by: Name, Personal no., email** " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "General guidelines:\n", "* All solutions to theoretical problems, can be submitted as a single file named *report.pdf*. They can also be submitted in this ipynb notebook, but equations wherever required, should be formatted using LaTeX math-mode.\n", "* All discussion regarding practical problems, along with solutions and plots should be specified here itself. We will not generate the solutions/plots again by running your code.\n", "* Your name, personal number and email address should be specified above and also in your file *report.pdf*.\n", "* All datasets can be downloaded from the course website.\n", "* All tables and other additional information should be included." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Theoretical problems\n", "\n", "## [Naive Bayes Classifier, 6 points]\n", "\n", "A psychologist does a small survey on ''happiness''. Each respondent provides a vector with entries 1 or 0 corresponding to if they answered “yes” or “no” to a question respectively. The question vector has attributes \n", "$$\n", "x = (\\mbox{rich, married, healthy}) \\tag{1}\n", "$$\n", "\n", "Thus a response (1, 0, 1) would indicate that the respondent was\n", "''rich'', ''unmarried'' and ''healthy''. In addition, each respondent\n", "gives a value c = 1 if they are content wih their life and c = 0\n", "if they’re not. The following responses were obtained.\n", "\n", "$$\n", "c = 1: (1, 1, 1),(0, 0, 1),(1, 1, 0),(1, 0, 1) \\\\\n", "c = 0: (0, 0, 0),(1, 0, 0),(0, 0, 1),(0, 1, 0)\n", "$$\n", "\n", "1. Using naive Bayes, what is the probability that a person is ''not rich'', ''married'' and ''healthy'' is ''content''?\n", "\n", "2. What is the probability that a person who is ''not rich'' and ''married'' is content ? (i.e. we do not know if they are ''healthy'')\n", "\n", "## [Extending Naive Bayes, 4 points]\n", "\n", "Consider now, the following vector of attributes:\n", "\n", "* x_1 = 1 if customer is younger than 20 and 0 otherwise.\n", "* x_2 = 1 if customer is between 20 and 30 in age, and 0 otherwise.\n", "* x_3 = 1 if customer is older than 30 and 0 otherwise\n", "* x_4 = 1 if customer walks to work and 0 otherwise.\n", "\n", "Each vector of attributes has a label ''rich'' or ''poor''. Point out potential difficulties with your approach above to training using naive Bayes. Suggest and describe how to extend your naive Bayes method to this dataset.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Practical problems\n", "\n", "## [Bayes classifier, 5 points]\n", "\n", "Dowload the dataset **\"dataset2.txt\"**. You can use the following code for example:\n", "python\n", "from numpy import genfromtxt\n", "data = genfromtxt('dataset2.txt', delimiter=',')\n", "labels = data[:,-1]\n", "\n", "The dataset contains 3-dimensional data, X, generated from 2 classes with labels, y either +1 or -1. Each row of X and y contain one observation and one label respectively. There are 1000 instances of each class. \n", "\n", "a. Assume that the class conditional density is spherical Gaussian, and both classes have equal prior. Write the expression for the Bayes ( not **naive Bayes**) classifier i.e. derive\n", "$$\n", "P(y_{new} = -1 | x_{new} , X, y ) \\\\\n", "P(y_{new} = +1 | x_{new} , X, y ) ~.\n", "$$\n", "\n", "It is useful to note that the dependence on training data X, y for class 1 can be expressed as: \n", "\n", "$$ \n", "P( x_{new} | y_{new} = 1, X, y) = P(x_{new} |\n", "\\hat{\\mu}_{1}, \\hat{\\sigma}^{2}_{1})\n", "$$\n", "\n", "where \\hat{\\mu}_{1} \\in \\mathbb{R}^3 and \\hat{\\sigma}^{2}_{1}\\in \\mathbb{R} are MLE estimates for mean (3-dimensional) and variance based on training data with label +1 (and similarly for class 2 with label -1). \n", "\n", "b. Implement a function **sph_bayes()** which computes the probability of a new test point *Xtest* coming from class 1 (P1) and class 2 (P2). Finally, assign a label *Ytest* to the test point based on the probabilities P1 and P2.\n", "\n", "python\n", "def sph_bayes(Xtest, ...): # other parameters needed.\n", "\n", " return [P1, P2, Ytest]\n", "\n", "c. Write a function **new_classifier()**\n", "\n", "python\n", "def new_classifier(Xtest, mu1, mu2)\n", " \n", " return [Ytest]\n", "\n", "which implements the following classifier,\n", "$$\n", "f(x) = \\mbox{sign}\\left(\\frac{(\\mu_1 - \\mu_2)^\\top (x - b) }{\\|\\mu_1 - \\mu_2\\|_2} \\right)\n", "$$\n", "with b = \\frac{1}{2}(\\mu_1 + \\mu_2).\n", "\n", "d. Report 5-fold cross validation error for both classifiers.\n", "\n", "## [DIGITS dataset classifer, 5 points]\n", "\n", "Load the DIGITS dataset:\n", "python\n", "from sklearn import datasets\n", "digits = datasets.load_digits()\n", "\n", "This dataset contains 1797 samples of ten handwritten digit classes. You can further query and visualize the dataset using the various attributes of the returned dictionary:\n", "python\n", "data = digits.data\n", "print(data.shape)\n", "target_names = digits.target_names\n", "print (target_names)\n", "import matplotlib.pyplot as plt\n", "y = digits.target\n", "plt.matshow(digits.images[0])\n", "plt.show()\n", "\n", "\n", "a. Use **new_classifier()** designed previously to do binary classification between classes representing digits \"*5*\" and \"*8*\".\n", "\n", "b. Investigate an alternative feature function as described below:\n", "\n", "1. Scale each pixel value to range [0, 1] from original gray-scale (0-255). \n", "2. Compute variance of each row and column of the image. This will give you a new feature vector of size 16 i.e. \n", "\n", "$$ \n", "x' = \\left[ \\; Var(row_1) , Var(row_2), \\ldots , Var(row_{8}), Var(col_1), \\ldots, Var(col_{8}) \\;\\right]^T\n", "$$\n", "\n", "c. Report$5$-fold cross validation results for parts$(a)$and\n", "$(b)\$ in a single table. What can you say about the results?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }