# Machine Learning with Scala – Linear Regression

Linear regression is one of the most popular machine learning algorithms. It’s used when we want to predict continuous values, like predicting stock prices. In this tutorial, we will build linear regression model from scratch and train it using gradient descent. We will use Nd4j – linear algebra and signal processing library for JVM . If you are new to basic linear algebra and Nd4j have a look at my tutorial where I discuss these topics.

##### Supervised Machine Learning

Linear regression is a supervised learning algorithm. In supervised learning, we feed machine learning algorithm with pairs of data (x, y), where x is called feature and y is the desired output value, called label. From it, the supervised learning algorithm seeks to build a model that can make predictions.  If you are new to machine learning it might be a bit confusing, so let’s repeat it: machine learning algorithms don’t make predictions but they produce function called model that makes predictions.

##### Linear Regression

Linear regression is used when we want to predict continuous values, as opposed to classification tasks where the goal is to predict a class label which belongs to the predefined list of possibilities. In its simplest form, linear regression model fits the linear function to a set of data points. The form of the linear function is as follows:
$$y= \Theta_1 x + \Theta_0$$

where:

$$\Theta_1$$ is the slope

$$\Theta_0$$ is the x-intercept also called bias

x is the feature

We can rewrite above equation using vector notation, as a dot product of the parameter vector $$\Theta$$ and feature vector x.

$$y = \Theta^Tx$$

##### Cost Function

The cost function is one of the most important concepts in the whole machine learning, simply put it tells us how well the machine learning algorithm is performing. It’s also called objective function because the goal (an objective) of machine learning algorithm is to minimize this function. We often denote cost function by  $$J(\Theta)$$. It is very important to note that the cost function is a function of $$\Theta$$ because $$\Theta$$ is unknown, x and y are given by training data. There are a lot of different cost function used in machine learning, in this tutorial we will use MSE (Mean Squared Error) cost function.

$$J(\Theta)=\frac{1}{n} \sum_{i=0}^n(y_i – (\Theta_1x_i + \Theta_0))^2$$

We want to make the error (the difference between prediction and label y) as small as possible. In an ideal scenario, if our prediction is equal to the label, the difference will be zero, so the value of the cost function will be zero. The errors are squared so that they are positive, they can be added together rather than canceling each other out.

Cost function gives us a positive number, we want this number to be as small as possible, ideally equal to zero.

I mentioned already that the goal of machine learning algorithm is to minimize the cost function, in another word we want to find arguments (in our case parameters $$\Theta$$) to the cost function which yields minimum value, the whole process is called optimization. We will be using gradient descent which is an iterative optimization algorithm. In every iteration, we compute the gradient of the cost function with respect to each parameter and update the parameter of the function via the following.

$$\Theta := \Theta\,\, – \alpha\frac{d}{\partial\Theta}J(\Theta)$$$$\Theta$$ – is the parameter vector, $$\alpha$$ – learning rate, $$J(\Theta)$$ – is a cost function

Now if we calculate the gradient of MSE cost function with respect to parameters we end up with the following update rule:

$$\Theta := \Theta\,\, – \alpha\frac{1}{n} \sum_{i=0}^n(y_i – (\Theta_1x_i + \Theta_0))x_i$$

The reason why we are subtracting gradient is that the gradient gives us the direction of steepest ascent but we want to minimize our function so we need to take the step in the opposite direction of the gradient. Learning rate determines how fast we want to update the parameters during optimization if learning rate is too small, gradient descent can be slow to find the minimum, and if it’s too large gradient descent may not converge (it can overshoot the minima).  Learning rate is a hyperparameter because as opposed to model parameters $$\Theta$$ it’s not learned during training but it needs to be set before training. Learning rate is considered to be the most important hyperparameter. Another example of a hyperparameter is the number of iterations.

##### Data

We will use the dataset which I downloaded from United Nation website, it contains data about the number of Internet users per 100 people from 2001 to 2015.

As you can see from the figure above,  there is very strong positive correlation between year and number of Internet users, which is not surprising.  Our feature vector contains a year and a bias feature which is 1 for all training examples,  correspondingly parameter vector will be as well two dimensional.  Using this dataset we will train linear regression model, which we use for prediction.

##### The Code

Enough theory, now it’s time to code it out. LinearRegression class expose two public methods, fit method which is used to train our model and the predict method to make predictions.

The fit method takes as arguments x which is our feature vector, y is a label,  learning rate and the number of iterations.  It adds bias to our features vector,  then creates $$\Theta$$ vector and initialize it with zeros. Finally, it invokes the computeGradient method which is the core of the whole algorithm, it iteratively performs gradient descent and updates the parameters according to the update rule. At each iteration calculates and prints the cost function.

##### Training

First, we load data from the CSV file, we are using DataVec –  ETL library for machine learning.

We set  hyperparameters learning rate and the number of iterations. Create LinearRegression object and perform training by invoking the fit function.

The output generated during training shows how the error is decreasing with every iteration, which is a good sign. Decaying cost means that our model is learning.

Iteration 0 cost: 161.89946
Iteration 1 cost: 73.558685
Iteration 2 cost: 34.030277
Iteration 3 cost: 16.339197
Iteration 4 cost: 8.417639
Iteration 5 cost: 4.8667493

……

Iteration 498 cost: 0.6028159
Iteration 499 cost: 0.6026632
Iteration 500 cost: 0.60251325
Parameters theta_0 = 3.5366528 theta_1 = 2.5968494

Once our model is trained, we can make the prediction, ask the model how many Internet users will be in 2027.

The predicted number of Internet users (per 100 people) in 2027.0 is 73.65159

##### Conclusion

We created linear regression model from scratch,  along the way, we also introduced several basic machine learning concepts like cost function,  gradient descent, supervised learning.  The full code is available on Github.  If you do not feel like coding, check out linear regression simulator and play with linear regression in your browser.