Neural Network Simulator is a real feedforward neural network running in your browser. The simulator will help you understand how artificial neural network trained using backpropagation algorithm works. In this tutorial, we will explain several important concepts and techniques used in the simulator, and then go through all the training steps, explaining the math behind each in more details to make the most of the simulator.

##### XOR function

The goal of the training is to learn XOR function, which is a simple but not the most trivial function to learn, it requires at least one hidden layer to solve the problem, and as we have one hidden layer we can use backpropagation algorithm which is used extensively to train neural networks.

$$

\begin{array}{lcr}

x_{1} & x_2 & \text{y} \\

\hline

1 & 1 & 0 \\

0& 1 & 1 \\

1& 0 & 1 \\

1& 1 & 0

\end{array}

$$

##### SGD

During training we are using single training examples for one forward/backward pass, this technique is called stochastic gradient descent. It’s used in the simulator because it makes calculations easy to follow if we deal with only one training example. When using SGD we can observe fluctuations of cost function as in the figure below.

SGD fluctuations source: Wikipedia

You can observe these fluctuations in the simulator as well, even after many thousands of iterations you might see significantly different values of the cost (in step 4) for different training examples.

##### Weights

A neural network’s knowledge acquired during training is stored in weights. Current values of all weights are displayed on the main graph representing the neural network so that you can observe them, how they are changing during training. Weights are initialized randomly between -1 and 1.

##### Lets introduce notation.

\(x_{j}\; – \; input \; to \;neuron \;j\)

\(w_{ij}^{(l)} \; -weight\;from\;layer\;l-1\; neuron\; i\;to\;layer\;l\;node\;j\)

\(a_{j}^{(l)}\; – \;activation\;of\;the\;j\;neuron\;in\;the\;l\;layer\)

\(\delta_{j}^{(l)}\;-\;error\;in\;the\;j\;neuron\;in\;the\;l\;layer\)

\(\sigma(x) = \frac{1}{1 + e^{x}} \; – sigmoid \; activation \; function\)

\(y\; – \; target \;value\)

\(E=\frac{1}{2}(y – a_{1}^{(3)})^{2} \;- cost\;function\)

##### Forward pass

##### Step 1) Input Layer

Input layer is used for inputting data to the network, the number of nodes in the input layer is determined by the number of features in the training data, in our case, there are two nodes \(x_{1}\) and \(x_{2}\).

##### Step 2) Hidden layer

In this step, we calculate activations of neurons in the hidden layer.

\(a_1^{(2)} = \sigma(w_{11}^{(2)}x_1 + w_{21}^{(2)}x_2)\)

\(a_2^{(2)} = \sigma(w_{12}^{(2)}x_1 + w_{22}^{(2)}x_2)\)

\(a_3^{(2)} = \sigma(w_{13}^{(2)}x_1 + w_{23}^{(2)}x_2)\)

##### Step 3) Output layer

Calculate activation of neuron in output layer.

\(a_1^{(3)} = \sigma(w_{11}^{(3)}a_1^{(2)} + w_{21}^{(3)}a_2^{(2)} + w_{31}^{(3)}a_3^{(2)})\)

##### Step 4) Cost function

Cost function tells us how well the neural network is performing. In the simulator, the quadratic cost function is used, which is defined for single training examples as follows.

\(E=\frac{1}{2}(y\; – a_{1}^{(3)})^{2}\)

Cost function gives as a positive number, we want this number to be as small as possible, to put it differently, we want the difference between the output of neural network and the target value to be as small as possible.

##### Backpropagation

The goal of backpropagation is to compute \(\frac{\partial E}{\partial w}\), error with respect to any weight, which tells us how changing the weights changes the cost function. In order to compute \(\frac{\partial E}{\partial w}\) we need to first compute error \(\delta_{j}^{(l)}\) for any node in the network.

##### Step 5) Error in output layer

In the output layer we define \(\delta_{j}^{(l)}\) as follows:

\(\delta_{j}^{(l)} = \frac{\partial E}{\partial z_{j}^{(l)}}\)

\(z_{j}^{(l)}\) is a weighted linear combination of input to the activation function for neuron j in layer l:

\(z_{j}^{(l)} =\sum_{i} w_{ij}^{(l)}a_i^{(l-1)}\)

If we apply chain rule we get:

\(\delta_{j}^{(l)} = \frac{\partial E}{\partial a_{j}^{(l)}}\frac{\partial a_{j}^{(l)}}{\partial z_{j}^{(l)}}=\frac{\partial E}{\partial a_{j}^{(l)}}\sigma^{\prime}(z_{j}^{(l)})\)

and

\(\sigma^{\prime}(z_{j}^{(l)}) = a_{j}^{(l)}(1-a_{j}^{(l)})\)

\(\frac{\partial E}{\partial a_{j}^{(l)}} = (y-a_{j}^{(l)})\)

After putting all those equations together, in our case output layer is a layer number 3 with one neuron, we end up with formula defined in step 5 in the simulator:

\(\delta_{1}^{(3)}=(y-a_{1}^{(3)})a_{1}^{(3)}(1-a_{1}^{(3)})\)

##### Step 6) Error in the hidden layer

The definition of the error in hidden layer is different from the error we defined for output layer in step 5:

\(\delta_{j}^{(l)}=(\delta_{j}^{(l+1)} w_{j}^{(l+1)})a_{j}^{(l)}(1-a_{j}^{(l)})\)

As we have 3 hidden neurons, \(\delta\) for all neurons in hidden layer:

\(\delta_{1}^{(2)}=(\delta_{1}^{(3)} w_{11}^{(3)})a_{1}^{(2)}(1-a_{1}^{(2)})\)

\(\delta_{2}^{(2)}=(\delta_{1}^{(3)} w_{21}^{(3)})a_{2}^{(2)}(1-a_{2}^{(2)})\)

\(\delta_{3}^{(2)}=(\delta_{1}^{(3)} w_{31}^{(3)})a_{3}^{(2)}(1-a_{3}^{(2)})\)

##### Step 7) Calculate error with respect to weights between hidden and output layer.

As we already calculated error \(\delta\) for output and hidden neurons, we can finally calculate \(\frac{\partial E}{\partial w}\) defined as:

\(\frac{\partial E}{\partial w_{jk}^{(l)}}=a_{k}^{(l-1)}\delta_{j}^{(l)}\)

For all three weights between hidden and the output layer.

\(\frac{\partial E}{\partial w_{11}^{(3)}}=a_{1}^{2}\delta_{1}^{(3)}\)

\(\frac{\partial E}{\partial w_{21}^{(3)}}=a_{2}^{2}\delta_{1}^{(3)}\)

\(\frac{\partial E}{\partial w_{31}^{(3)}}=a_{3}^{2}\delta_{1}^{(3)}\)

##### Step 8) Calculate error with respect to weights between input and hidden layer.

For all six weights between input and hidden layer.

\(\frac{\partial E}{\partial w_{11}^{(2)}}=x_{1}\delta_{1}^{(2)}\)

\(\frac{\partial E}{\partial w_{12}^{(2)}}=x_{1}\delta_{1}^{(2)}\)

\(\frac{\partial E}{\partial w_{13}^{(2)}}=x_{1}\delta_{1}^{(2)}\)

\(\frac{\partial E}{\partial w_{21}^{(2)}}=x_{2}\delta_{1}^{(2)}\)

\(\frac{\partial E}{\partial w_{22}^{(2)}}=x_{2}\delta_{1}^{(2)}\)

\(\frac{\partial E}{\partial w_{23}^{(2)}}=x_{2}\delta_{1}^{(2)}\)

##### Step 9) Update weights between hidden and output layer.

\(w_{11}^{(3)}:=w_{11}^{(3)} + \frac{\partial E}{\partial w_{11}^{(3)}}\)

\(w_{21}^{(3)}:=w_{21}^{(3)} + \frac{\partial E}{\partial w_{21}^{(3)}}\)

\(w_{31}^{(3)}:=w_{31}^{(3)} + \frac{\partial E}{\partial w_{31}^{(3)}}\)

##### Step 10) Update weights between input and hidden layer.

\(w_{11}^{(2)}:=w_{11}^{(2)} + \frac{\partial E}{\partial w_{11}^{(2)}}\)

\(w_{12}^{(2)}:=w_{12}^{(2)} + \frac{\partial E}{\partial w_{12}^{(2)}}\)

\(w_{13}^{(2)}:=w_{13}^{(2)} + \frac{\partial E}{\partial w_{13}^{(2)}}\)

\(w_{21}^{(2)}:=w_{21}^{(2)} + \frac{\partial E}{\partial w_{21}^{(2)}}\)

\(w_{22}^{(2)}:=w_{22}^{(2)} + \frac{\partial E}{\partial w_{22}^{(2)}}\)

\(w_{23}^{(2)}:=w_{23}^{(2)} + \frac{\partial E}{\partial w_{23}^{(2)}}\)

##### Model capacity

Observing how weights are changing during training is a valuable learning experience. At the beginning weights are initialized randomly between -1 and 1. Sigmoid function within this range behaves like the linear function, so the whole network doesn’t have more capacity than the linear network.

After several thousand iterations you might notice that absolute values of weights are growing, hidden units start using nonlinear ranges and thus the capacity grows, so that network can learn XOR function.

##### Training

It might take around 100 000 iterations for the neural network to converge, look at the cost it should be small for all training examples.

##### Conclusion

Understanding how neural network is trained using backpropagation algorithm is a challenging task. The secret can be revealed by looking at numbers, how activation are computed during forward pass, how the error is computed and then backpropagated, how weights are changing during training. If you use the large network is hard to follow all these computations. Using simulator gives you unique possibility to see the neural network in “action”. Math used to derive all these equations might be complex, but we end up with simple formulas, using operations like addition, multiplication.

Thx a lot for this great tutorial/simulator.

I have a question : where there is no bias in your simulation?

Thx

Thanks! The only reason there are no bias neurons is to keep it simple, the network has enough capacity to learn XOR function.