Linear regression is one of the most popular machine learning algorithms. It’s used when we want to predict continuous values, like predicting stock prices. In this tutorial, we will build linear regression model from scratch and train it using gradient descent. We will use Nd4j – linear algebra and signal processing library for JVM . If you are new to basic linear algebra and Nd4j have a look at my tutorial where I discuss these topics.

##### Supervised Machine Learning

Linear regression is a supervised learning algorithm. In supervised learning, we feed machine learning algorithm with pairs of data (x, y), where x is called feature and y is the desired output value, called label. From it, the supervised learning algorithm seeks to build a model that can make predictions. If you are new to machine learning it might be a bit confusing, so let’s repeat it: **machine learning algorithms don’t make predictions but they produce function called model that makes predictions**.

##### Linear Regression

Linear regression is used when we want to predict continuous values, as opposed to classification tasks where the goal is to predict a class label which belongs to the predefined list of possibilities. In its simplest form, linear regression model fits the linear function to a set of data points. The form of the linear function is as follows:

$$ y= \Theta_1 x + \Theta_0$$

where:

\(\Theta_1\) is the slope

\(\Theta_0\) is the x-intercept also called bias

x is the feature

We can rewrite above equation using vector notation, as a dot product of the parameter vector \(\Theta\) and feature vector x.

$$y = \Theta^Tx$$

##### Cost Function

The cost function is one of the most important concepts in the whole machine learning, simply put it tells us how well the machine learning algorithm is performing. It’s also called objective function because the goal (an objective) of machine learning algorithm is to minimize this function. We often denote cost function by \(J(\Theta)\). **It is very important to note that the cost function is a function of \(\Theta\) because \(\Theta\) is unknown, x and y are given by training data. **There are a lot of different cost function used in machine learning, in this tutorial we will use MSE (Mean Squared Error) cost function.

$$J(\Theta)=\frac{1}{n} \sum_{i=0}^n(y_i – (\Theta_1x_i + \Theta_0))^2$$

We want to make the error (the difference between prediction and label y) as small as possible. In an ideal scenario, if our prediction is equal to the label, the difference will be zero, so the value of the cost function will be zero. The errors are squared so that they are positive, they can be added together rather than canceling each other out.

Cost function gives us a positive number, we want this number to be as small as possible, ideally equal to zero.

##### Gradient Descent

I mentioned already that the goal of machine learning algorithm is to minimize the cost function, in another word we want to find arguments (in our case parameters \(\Theta\)) to the cost function which yields minimum value, the whole process is called optimization. We will be using gradient descent which is an iterative optimization algorithm. In every iteration, we compute the gradient of the cost function with respect to each parameter and update the parameter of the function via the following.

$$\Theta := \Theta\,\, – \alpha\frac{d}{\partial\Theta}J(\Theta)$$\(\Theta\) – is the parameter vector, \(\alpha\) – learning rate, \(J(\Theta)\) – is a cost function

Now if we calculate the gradient of MSE cost function with respect to parameters we end up with the following update rule:

$$\Theta := \Theta\,\, – \alpha\frac{1}{n} \sum_{i=0}^n(y_i – (\Theta_1x_i + \Theta_0))x_i$$

The reason why we are subtracting gradient is that the gradient gives us the direction of steepest ascent but we want to minimize our function so we need to take the step in the opposite direction of the gradient. Learning rate determines how fast we want to update the parameters during optimization if learning rate is too small, gradient descent can be slow to find the minimum, and if it’s too large gradient descent may not converge (it can overshoot the minima). Learning rate is a hyperparameter because as opposed to model parameters \(\Theta\) it’s not learned during training but it needs to be set before training. Learning rate is considered to be the most important hyperparameter. Another example of a hyperparameter is the number of iterations.

##### Data

We will use the dataset which I downloaded from United Nation website, it contains data about the number of Internet users per 100 people from 2001 to 2015.

As you can see from the figure above, there is very strong positive correlation between year and number of Internet users, which is not surprising. Our feature vector contains a year and a bias feature which is 1 for all training examples, correspondingly parameter vector will be as well two dimensional. Using this dataset we will train linear regression model, which we use for prediction.

##### The Code

Enough theory, now it’s time to code it out. LinearRegression class expose two public methods, fit method which is used to train our model and the predict method to make predictions.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
import org.nd4j.linalg.api.ndarray.INDArray import org.nd4j.linalg.factory.Nd4j import org.nd4j.linalg.ops.transforms.Transforms._ import org.nd4s.Implicits._ import org.nd4s.Evidences.float class LinearRegression { private var computedThetas: INDArray = _ def fit(x: INDArray, y: INDArray, lr: Float, iters: Int): INDArray = { val bias = Nd4j.ones(x.rows(), 1) val xWithBias = Nd4j.concat(1, bias, x) val thetas = Nd4j.zeros(1, xWithBias.columns()).reshape(1, xWithBias.columns()) computedThetas = computeGradient(xWithBias, y, thetas, lr, iters) computedThetas } def predict(x: INDArray): Float = { val bias = Nd4j.ones(x.rows(), 1) val xWithBias = Nd4j.concat(1, bias, x) xWithBias.mmul(computedThetas.T).getFloat(0) } private def computeCost(x: INDArray, y: INDArray, thetas: INDArray): Float = { val cost = pow((x.mmul(thetas.T)) - y, 2) (cost.sum(0)/(2*cost.length)).getFloat(0) } private def computeGradient(x: INDArray, y: INDArray, thetas: INDArray, lr: Float, iterations: Int): INDArray ={ val nbOfTrainingExamples = x.rows val computedThetas = (0 to iterations).foldLeft(thetas)({ case (thetas, i) => val error = x.mmul(thetas.T) - y val grad = error.T.mmul(x)/nbOfTrainingExamples val updatedThetas = thetas - grad*lr println(s"Iteration ${i} cost: ${computeCost(x, y, updatedThetas)}") updatedThetas }) computedThetas } } |

The fit method takes as arguments x which is our feature vector, y is a label, learning rate and the number of iterations. It adds bias to our features vector, then creates \(\Theta\) vector and initialize it with zeros. Finally, it invokes the computeGradient method which is the core of the whole algorithm, it iteratively performs gradient descent and updates the parameters according to the update rule. At each iteration calculates and prints the cost function.

##### Training

First, we load data from the CSV file, we are using DataVec – ETL library for machine learning.

1 2 3 4 5 6 7 8 9 |
val numLinesToSkip = 1 val delimiter = ',' val recordReader = new CSVRecordReader(numLinesToSkip, delimiter) recordReader.initialize(new FileSplit(new ClassPathResource("internet-users.csv").getFile)) val iter:DataSetIterator = new RecordReaderDataSetIterator(recordReader, 1000000,0,0, true) val dataSet: DataSet = iter.next() val x = dataSet.getFeatures() val y = dataSet.getLabels() |

We set hyperparameters learning rate and the number of iterations. Create LinearRegression object and perform training by invoking the fit function.

1 2 3 4 5 6 |
val lr = 0.02f val iterations = 500 val model = new LinearRegression() val params = model.fit(x, y, lr, iterations) println(s"Parameters theta_0 = ${params.getFloat(0)} theta_1 = ${params.getFloat(1)}") |

The output generated during training shows how the error is decreasing with every iteration, which is a good sign. Decaying cost means that our model is learning.

Iteration 0 cost: 161.89946

Iteration 1 cost: 73.558685

Iteration 2 cost: 34.030277

Iteration 3 cost: 16.339197

Iteration 4 cost: 8.417639

Iteration 5 cost: 4.8667493

……

Iteration 498 cost: 0.6028159

Iteration 499 cost: 0.6026632

Iteration 500 cost: 0.60251325

Parameters theta_0 = 3.5366528 theta_1 = 2.5968494

Once our model is trained, we can make the prediction, ask the model how many Internet users will be in 2027.

1 2 3 |
val year = 27f val pred = model.predict(Nd4j.create(Array(year))) println(s"The predicted number of Internet users (per 100 people) in 20${year} is ${pred}") |

The predicted number of Internet users (per 100 people) in 2027.0 is **73.65159**

##### Conclusion

We created linear regression model from scratch, along the way, we also introduced several basic machine learning concepts like cost function, gradient descent, supervised learning. The full code is available on Github. If you do not feel like coding, check out linear regression simulator and play with linear regression in your browser.

*Please leave your comments, suggestion, feedback.*

I guess at some point(year) there will be more then 100/100 users on the Internet 😀

According to the model it will happen in 2037.