Regression evaluation example

The following code shows a simple example of how to evaluate the predictions of a regression models.

We use the same dataset as in the first exercise. The task is to predict the surface area of a protein. You can download the dataset in the UCI machine learning repository.

In [1]:
import numpy as np

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

We read the dataset and split it into training and test parts. We use NumPy's utility function np.loadtxt to read the CSV file. The first column of the data is the output that we want to predict, and the rest of the columns are the features.

In [2]:
alldata = np.loadtxt('datasets/CASP.csv', skiprows=1, delimiter=',')

Yall = alldata[:,0]
Xall = alldata[:,1:]

Xtrain, Xtest, Ytrain, Ytest = train_test_split(Xall, Yall, test_size=0.2, random_state=0)

We make a regression model. We don't need a vectorizer here, since the data is already in a numerical format.

You can play around and try a few different regressors. This is a complex nonlinear prediction problem, so a nonlinear model such as the RandomForestRegressor (this model will be discussed in the next lecture) works better than a linear regression model, e.g. Ridge.

In [3]:
np.random.seed(0)


from sklearn.tree import DecisionTreeRegressor

reg_model = RandomForestRegressor(n_estimators=100)
#reg_model = Ridge()
#reg_model = DecisionTreeRegressor()
#reg_model = ExtraTreesRegressor()
#reg_model = GradientBoostingRegressor(n_estimators=500)
#reg_model = MLPRegressor(activation=)

We train the regression model, and then evaluate the quality of the predictions by two metrics: Mean Squared Error and $R^2$, the coefficient of determination.

The $R^2$ score has the advantage that it doesn't depend on the measurement scale: the score of a perfect regression model is always 1.0, while low-quality predictors have $R^2$ scores near 0. It seems our regressor is doing a fairly decent job in this case.

In [4]:
reg_model.fit(Xtrain, Ytrain)
Yguess = reg_model.predict(Xtest)

print('MSE =', mean_squared_error(Ytest, Yguess))
print('R2 =', r2_score(Ytest, Yguess))
MSE = 11.937809558494063
R2 = 0.6808103071299272

The following example shows how MSE and $R^2$ can be used in a cross-validation setup.

Note that the negative MSE will be used. The reason is that evaluation functions used in scikit-learn's cross-validation are assumed to return high values when the model is doing well.

In [5]:
print(cross_validate(reg_model, Xtrain, Ytrain, scoring='neg_mean_squared_error', return_train_score=False, cv=5))

print(cross_validate(reg_model, Xtrain, Ytrain, scoring='r2', return_train_score=False, cv=5))
{'fit_time': array([23.58613586, 23.80734873, 23.59993649, 22.78001165, 23.34117341]), 'score_time': array([0.19842649, 0.21207285, 0.19334388, 0.22495651, 0.20536208]), 'test_score': array([-13.12104922, -13.2865412 , -12.84910522, -12.95305421,
       -12.9649182 ])}
{'fit_time': array([23.35762525, 23.50466824, 23.42563605, 23.18065715, 23.06454349]), 'score_time': array([0.20499516, 0.22065163, 0.21537828, 0.21883798, 0.19995952]), 'test_score': array([0.6456363 , 0.64753133, 0.65314891, 0.65442795, 0.65674954])}