Gradient descent for linear regression using PyTorch

We continue with gradient descent algorithms for least-squares regression. This time we use PyTorch instead of plain NumPy. The advantages are:

  • we don't have to do any algebra to derive how to compute the gradients,
  • we don't have to implement the stochastic gradient descent optimization algorithm, and we may potentially use approaches that are more sophisticated than SGD.
In [1]:
import numpy as np

# import the PyTorch library
import torch

import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'svg' 
plt.style.use('bmh')

Fundamentals of PyTorch

The basic building block in PyTorch is the Tensor. This is conceptually similar to a NumPy array. The main differences between PyTorch tensors and NumPy arrays:

  • when we make computations, a Tensor will keep track of how it was computed. The purpose of this is so that we will be able to compute gradients.
  • a Tensor can be created in the normal memory or in a GPU.

The following example makes a 3-dimensional vector that we represent as a PyTorch tensor. Note that we say explicitly that the type of the numbers in the tensor is float (that is, 32-bit floating point numbers). PyTorch is a bit more picky than NumPy about mixing types of numbers, so we will use this type everywhere.

In [2]:
x = torch.tensor([1, 2, 3], dtype=torch.float) 

# You can explicitly say where the tensor should be stored. Here, "cpu" means 
# that we use the computer's normal memory.
#   x = torch.tensor([1, 2, 3], dtype=torch.float, device='cpu') 

# On the other hand, if you have a GPU on your machine and CUDA is available, 
# you can create a tensor stored in the GPU:
#   x = torch.tensor([1, 2, 3], dtype=torch.float, device='cuda')

# Alternatively, you can move an existing tensor over to the GPU.
#  x = x.to('cuda')

This tensor can be inspected easily, just like a NumPy array.

In [3]:
x
Out[3]:
tensor([1., 2., 3.])

If we pretend that x above is some feature vector, then let's also make a tensor corresponding to a weight vector w. We'll initialize it to all zeros. Instead of listing all number explicitly as above, we use the built-in fucntion zeros to do this.

Note that we wrote requires_grad=True here. This means that we intend to compute gradients with respect to the weight vectors, which we'll need when running optimization algorithms such as gradient descent.

In [4]:
w = torch.zeros(3, requires_grad=True)

# equivalent to
#   w = torch.tensor([0, 0, 0], dtype=torch.float, requires_grad=True)

Again, we can inspect this tensor if necessary.

In [5]:
w
Out[5]:
tensor([0., 0., 0.], requires_grad=True)

We'll pretend that w is a regression model, and x is our input feature vector. Then we can compute the model's prediction for x, given the current state of w. This prediction will of course be 0 at this point, because the weights are 0 at the moment.

A small difference between linear algebra operations in NumPy and PyTorch is that dot is used in NumPy for vector-vector dot products, vector-matrix multiplications, and matrix-matrix multiplications. In PyTorch, these three operations are separate.

Note that the result is a tensor that remembers that it was computed using a dot product.

In [6]:
# note: in torch, "dot" is between two vectors
# mv for matrix-vector
# mm for matrix-matrix

guess = w.dot(x)


guess
Out[6]:
tensor(0., grad_fn=<DotBackward>)

Assuming that the true desired value is y = 5, we can also compute the value of a squared error loss for the prediction we made before.

Again, the result is the value of the loss (25), plus a "memory" how this result was computed.

In [7]:
y = 5

squared_error = (y - guess)**2


squared_error
Out[7]:
tensor(25., grad_fn=<PowBackward0>)

Let's compute the gradients of the loss functions. This is called backward in PyTorch.

When calling backward, gradients will be computed with respect to all tensors that were involved in the computation of the loss function, if those tensors had requires_grad=True. In our case, this is just the weight vector w.

In [8]:
squared_error.backward()

If we want to see the gradient of the loss with respect to w, it can be accessed (after calling backward) as w.grad.

In [9]:
w.grad
Out[9]:
tensor([-10., -20., -30.])

Implementing the least-squares linear regression training algorithm in PyTorch

To see how the automatic gradient computations and optimization algorithms in PyTorch are used, here is again the SGD-based linear regression training that we saw before.

The interesting addition here is that we use an optimizer: an algorithm that will make use of the gradients to optimize the objective. SGD is just one of the possible choices. See the description of optimizers in PyTorch's official documentation.

We'll solve it in two different ways. First, a "basic" approach using the notions we introduced above, and later again using a more typical PyTorch programming style. In the first case, the code is almost identical to the minibatch SGD we saw before, just replacing NumPy linear algebra operations with their PyTorch equivalents, and replacing the update part of SGD with the optimizer.

In [10]:
class LeastSquaresRegressorTorch1():

    def __init__(self, n_iter=10, eta=0.1, batch_size=10):
        self.n_iter = n_iter
        self.eta = eta
        self.batch_size = batch_size
        
    def fit(self, X, Y):

        n_instances, n_features = X.shape
        
        # we need to "wrap" the NumPy arrays X and Y as PyTorch tensors
        Xt = torch.tensor(X, dtype=torch.float)
        Yt = torch.tensor(Y, dtype=torch.float)

        # initialize the weight vector to all zeros
        self.w = torch.zeros(n_features, requires_grad=True, dtype=torch.float)

        self.history = []
        
        # we select an optimizer, in this case (minibatch) SGD.
        # it needs to be told what parameters to optimize, and what learning rate (lr) to use
        optimizer = torch.optim.SGD([self.w], lr=self.eta)
        
        # as an alternative to SGD, we could have used adaptive gradient-based optimization
        # algorithms such as Adam. I don't think they give an improvement in this case though,
        # since the objective function is so simple.
        #   optimizer = torch.optim.Adam([self.w], lr=self.eta)
        
        for i in range(self.n_iter):
            
            total_loss = 0
            
            for batch_start in range(0, n_instances, self.batch_size):
                batch_end = batch_start + self.batch_size

                # pick out the batch again, as in the other notebook
                Xbatch = Xt[batch_start:batch_end, :]
                Ybatch = Yt[batch_start:batch_end]
            
                # mv = matrix-vector multiplication in Torch
                G = Xbatch.mv(self.w)
                
                # note the similarities to the NumPy implementation
                Error = G - Ybatch
                loss_batch = torch.sum(Error**2) / n_instances
                
                # we sum up the loss values for all the batches.
                # the item() here is to convert the tensor into a single number
                total_loss += loss_batch.item()
                
                # reset all gradients
                optimizer.zero_grad()                  

                # compute the gradients for the loss for this batch
                loss_batch.backward()

                # for SGD, this is equivalent to w -= learning_rate * gradient as we saw before
                optimizer.step()
                          
            self.history.append(total_loss)
            
        print('SGD-minibatch final loss: {:.4f}'.format(total_loss))

We'll use the same synthetic toy dataset as in the other notebook.

In [11]:
np.random.seed(0)

Xtrain = np.random.normal(size = (1000, 1))
Ytrain = -0.9 * Xtrain[:,0] + 0.4 * np.random.normal(size=Xtrain.shape[0])

Xtest = np.sort(np.random.normal(size = (400, 1)))
Ytest = -0.9 * Xtest[:,0] + 0.4 * np.random.normal(size=Xtest.shape[0])

... and as you see, we see a similar development of the loss function as we iterate a number of times through the training set.

In [12]:
regr = LeastSquaresRegressorTorch1(n_iter=10, eta=1.5, batch_size=10)
regr.fit(Xtrain, Ytrain)

plt.plot(regr.history, '.-');
SGD-minibatch final loss: 0.1502

Second implementation

The second implmentation does the same thing as the first one and is quite similar, but we use a slightly more typical PyTorch programming style here. For instance, we use a "linear layer", as we say using neural network jargon. See the official documentation of the Linear class.

We comment just on the things that are different from the implementation above..

In [13]:
# The class that represents our learning algorithm inherits from `torch.nn.Module`.
# The purpose of this is that the model can more easily keep track of its parameters in this way.
class LeastSquaresRegressorTorch2(torch.nn.Module):

    def __init__(self, n_iter=10, eta=0.1, batch_size=10):
        super(LeastSquaresRegressorTorch2, self).__init__()
        self.n_iter = n_iter
        self.eta = eta
        self.batch_size = batch_size
               
    def fit(self, X, Y):

        n_instances, n_features = X.shape
        
        Xt = torch.tensor(X, dtype=torch.float)
        Yt = torch.tensor(Y, dtype=torch.float)
                
        self.history = []
        
        # here we use the linear layer instead of just defining a tensor w.
        # this will automatically be registered in this object as a parameter.
        self.w = torch.nn.Linear(n_features, 1, bias=False)
        
        # initialize the weights in the linear layer to zeros
        torch.nn.init.zeros_(self.w.weight)

        # we use SGD here again.
        # note that we called self.parameters(). This gives a list of the parameters 
        # stored in this Module object; in our case, this is the linear layer w.
        optimizer = torch.optim.SGD(self.parameters(), lr=self.eta)
        
        for i in range(self.n_iter):
            
            total_loss = 0
            
            for batch_start in range(0, n_instances, self.batch_size):
                batch_end = batch_start + self.batch_size

                Xbatch = Xt[batch_start:batch_end, :]
                Ybatch = Yt[batch_start:batch_end]
            
                # compute the predictions for the instances in this minibatch.
                # note that we "call" the linear layer instead of explicitly computing the 
                # matrix-vector multiplication.
                G = self.w(Xbatch)
                
                Error = G.t() - Ybatch
                loss_batch = torch.sum(Error**2) / n_instances
                
                total_loss += loss_batch.item()
                
                optimizer.zero_grad()                
                loss_batch.backward()
                optimizer.step()
                          
            self.history.append(total_loss)
            
        print('SGD-minibatch final loss: {:.4f}'.format(total_loss))

As expected, this gives identical results to what we had above.

In [14]:
regr = LeastSquaresRegressorTorch2(n_iter=10, eta=1.5, batch_size=10)
regr.fit(Xtrain, Ytrain)

plt.plot(regr.history, '.-');
SGD-minibatch final loss: 0.1502