Deep Learning Interview Questions and Answers

Basic questions and answers which will help you brush up your knowledge on deep learning.

Question 1
What is deep learning?

Deep learning is an area of machine learning focus on using deep (containing more than one hidden layer) artificial neural networks, which are loosely inspired by the brain. The idea dates back to the mid 1960s, Alexey Grigorevich Ivakhnenko published the first general, working deep learning network. Deep learning is applicable over a range of fields such as computer vision, speech recognition, natural language processing.

Question 2
Why are deep networks better than shallow ones?
Both shallow and deep networks are capable of approximating any function. For the same level of accuracy, deeper networks can be much more efficient in terms of computation and number of parameters. Deeper networks are able to create deep representations, at every layer, the network learns a new, more abstract representation of the input.
Question 3
What is a cost function?

Cost function tells us how well the neural network is performing. Our goal during training is to find parameters that minimize the cost function. For an example of a cost function, consider Mean Squared Error function:

$$MSE=\frac{1}{n} \sum_{i=0}^n(\hat Y_i – Y_i)^2$$

The mean of square differences between our prediction \(\hat Y\) and desired value \(Y\) is what we want to minimize.

Question 4
What is a gradient descent?

Gradient descent is an optimization algorithm used in machine learning to learn values of parameters that minimize the cost function. It’s an iterative algorithm, in every iteration, we compute the gradient of the cost function with respect to each parameter and update the parameters of the function via the following.

$$\Theta := \Theta\,\, – \alpha\frac{d}{\partial\Theta}J(\Theta)$$

\(\Theta\) – is the parameter vector, \(\alpha\) – learning rate,  \(J(\Theta)\) – is a cost function

Take a look at the linear regression simulator to see how gradient descent is used to perform linear regression.

Question 5
What is a backpropagation?

Backpropagation is a training algorithm used for a multilayer neural networks. It moves the error information from the end of the network to all the weights inside the network and thus allows for efficient computation of the gradient.

The backpropagation algorithm can be divided into several steps:

1. Forward propagation of training data through the network in order to generate output.
2. Use target value and output value to compute error derivative with respect to output activations.
3. Backpropagate to compute the derivative of the error with respect to output activations in the previous layer and continue for all hidden layers.
4. Use the previously calculated derivatives for output and all hidden layers to calculate the error derivative with respect to weights.
5. Update the weights.

The neural network simulator will guide you step-by-step through backpropagation.

Question 6
Explain the following three variants of gradient descent: batch, stochastic and mini-batch?
Stochastic Gradient Descent

Uses only single training example to calculate the gradient and update parameters.

Batch Gradient Descent

Calculate the gradients for the whole dataset and perform just one update at each iteration.

Mini-batch Gradient Descent

Mini-batch gradient is a variation of stochastic gradient descent where instead of single training example, mini-batch of samples is used. It’s one of the most popular optimization algorithms. 

Question 7
What are the benefits of mini-batch gradient descent?
  • Computationally efficient compared to stochastic gradient descent.
  • Improve generalization by finding flat minima.
  • Improving convergence, by using mini-batches we approximating the gradient of the entire training set, which might help to avoid local minima.
Question 8
Provide an example of matrix element-wise multiplication?

Element-wise matrix multiplication takes two matrices of the same dimensions and produce another matrix with elements that are a product of corresponding elements of matrix a and b.

a_{11} & a_{12} & a_{13} \\
a_{21}& a_{22} & a_{23} \\
a_{31}& a_{32} & a_{33} \\
\end{pmatrix}  \circ \begin{pmatrix}
b_{11} & b_{12} & b_{13} \\
b_{21}& b_{22} & b_{23} \\
b_{31}& b_{32} & b_{33} \\
\end{pmatrix} =
a_{11} b_{11} & a_{12} b_{12} & a_{13} b_{13} \\
a_{21} b_{21}& a_{22} b_{22} & a_{23} b_{23} \\
a_{31} b_{31}& a_{32} b_{32} & a_{33} b_{33} \\

Question 9
Provide example of matrix product?

Example of taking the product of two matrices, A \((2 \times 3)\) and B \((3 \times 2)\).
a_{11} & a_{12} & a_{13} \\
a_{21}& a_{22} & a_{23} \\
\end{pmatrix} \circ \begin{pmatrix}
b_{11} & b_{12} \\
b_{21}& b_{22} \\
b_{31}& b_{32} \\
\end{pmatrix} =
a_{11} b_{11} + a_{12} b_{21} + a_{13} b_{31} & a_{11} b_{12} + a_{12} b_{22} + a_{13} b_{32} \\
a_{21} b_{11} + a_{22} b_{21} + a_{23} b_{31} & a_{21} b_{12} + a_{22} b_{22} + a_{23} b_{32} \\

The matrix product is defined only when the number of columns in A is equal to the number of rows in B.

Question 10
Provide example of matrix transpose?

The transpose of a matrix is a new matrix which is formed by interchanging rows and columns.

a_{11} & a_{12} & a_{13} \\
a_{21}& a_{22} & a_{23} \\
\end{pmatrix}^T =
a_{11} & a_{21}  \\
a_{12} & a_{22} \\
a_{13}  & a_{23} \\

Question 11
What is one hot encoding?

One hot encoding is used to encode categorical features. We create a separate feature for each unique value so that values are equally different from each other. For example, let’s assume have a feature called color, which can take values like: red, blue, green.

\text{Red} & \text{Blue} & \text{Green} \\
1 & 0 & 0 \\
0& 1 & 0 \\
0& 0 & 1

Question 12
What is data normalization and why do we need it?

Data normalization is very important preprocessing step, used to rescale values to fit in a specific range to assure better convergence during backpropagation. In general, it boils down to subtracting the mean of each data point and dividing by its standard deviation.

Question 13
Weight initialization in neural networks?

Weight initialization is a very important step. Bad weight initialization can prevent a network from learning. Good initialization can lead to quicker convergence and better overall error. Biases can be generally initialized to zero. The general rule for setting the weights is to be close to zero without being too small.

Question 14
Why is zero initialization not a recommended weight initialization technique?

As a result of setting weights in the network to zero, all the neurons at each layer are producing the same output and the same gradients during backpropagation.
The network can’t learn at all because there is no source of asymmetry between neurons. That is why we need to add randomness to weight initialization process.

Question 15
What is the role of the activation function?

The goal of an activation function is to introduce non-linearity into the neural network so that it can learn more complex function. Without it, the neural network would be only able to learn function which is a linear combination of its input data.

Question 16
Provide some examples of activation functions?

The sigmoid function also known as the logistic function is used for binary classification, it’s continuous and it has an easily calculated derivative. It squashes real numbers to range between [0,1].

$$ f(x)= \frac{1}{(1 + e^{-x})}$$


Softmax is a generalization of the sigmoid function to the case where we want to handle multiple classes. All output values are in the range (0, 1) and sum up to 1.0 and therefore can be interpreted as probabilities that our input belongs to one of a set of output classes.

Rectified linear units – ReLU

ReLU outputs 0 if the input is less or equal to 0  and raw output otherwise, we can think of them as switches. Biologically inspired, enable training much deeper nets by backpropagation.  It does not suffer from the vanishing gradient problem and it’s very fast. It has been used in convolutional networks more effectively than the widely used logistic function.

$$ f(x)= max(0, x)$$

Question 17
What are hyperparameters, provide some examples?

Hyperparameters as opposed to model parameters can’t be learn from the data, they are set before training phase.

Learning rate

It determines how fast we want to update the weights during optimization, if learning rate is too small, gradient descent can be slow to find the minimum and if it’s too large gradient descent may not converge(it can overshoot the minima). It’s considered to be the most important hyperparameter.

Number of epochs

Epoch is defined as one forward pass and one backward pass of all training data.

Batch size

The number of training examples in one forward/backward pass.

Question 18
What is a model capacity?

Ability to approximate any given function. The higher model capacity is the larger amount of information that can be stored in the network.

Question 19
What is a convolutional neural network?

Convolutional neural networks, also known as CNN, are a type of feedforward neural networks that use convolution in at least one of their layers. The convolutional layer consists of a set of filter (kernels). This filter is sliding across the entire input image,  computing dot product between the weights of the filter and the input image.  As a result of training, the network learns filters that can detect specific features.

Question 20
What is an autoencoder?

Autoencoder is artificial neural networks able to learn representation for a set of data (encoding), without any supervision. The network learns by copying its input to the output, typically internal representation has smaller dimensions than input vector so that they can learn efficient ways of representing data. Autoencoder consist of two parts, an encoder tries to fit the inputs to an internal representation and decoder converts internal state to the outputs.

Question 21
What is a dropout?

Dropout is a regularization technique for reducing overfitting in neural networks. At each training step we randomly drop out (set to zero) set of nodes, thus we create a different model for each training case, all of these models share weights. It’s a form of model averaging.

Question 22
How we define the cross-entropy cost function?

Cross-entropy cost function is used for classification, it’s a natural choice if there is a sigmoid or softmax nonlinearity in the output layer.

$$C=- \frac{1}{n} \sum_{i=1}^n(y_i\ln a_i + (1-y_i)\ln (1-a_i))$$

The a – represents the output of the neural network, y – target value, n – is the total number of training examples.

Question 23
What are the differences between feedforward neural network and a recurrent neural network?

Feedforward network allows signals to travel one way only, from input to output. A recurrent neural network is a special network, which has unlike feedforward networks, recurrent connections. The RNN can be described using this recurrent formula:

$$s_t = f(s_{t-1}, x_t)$$

The state \(s_{t}\) at a time ‘t’ is a function of previous state \(s_{t-1}\) and the input \(x_{t}\) at the current time step, recurrent neural network maintains an internal state \(s_{t}\), by using their own output as a part of the input for next time step. This state vector summarizes the history of the sequence it has seen so far. Recurrent neural networks are Turing complete, can simulate arbitrary programs. Whereas feedforward network can just compute one fixed-size input to one fixed-size output, the RNN can handle sequential data of arbitrary length.

Question 24
What are some limitations of deep learning?
  • Deep learning usually requires large amounts of training data.
  • Deep neural networks are easily fooled.
  • Successes of deep learning are purely empirical, deep learning algorithms have been criticized as uninterpretable “black-boxes”.
  • Deep learning thus far has not been well integrated with prior knowledge.

Please leave your comments, suggestion, feedback.

10 thoughts to “Deep Learning Interview Questions and Answers”

Leave a Reply

Your email address will not be published. Required fields are marked *