This is a short tutorial on backpropagation and its implementation in Python, C++, and Cuda. The full codes for this tutorial can be found here.

Feed Forward

Let us consider the following densely connected deep neural network

taking as input $X \in \mathbb{R}^{n \times p_0}$ and outputting $F \in \mathbb{R}^{n \times p_{\ell+1}}$. Here, $n$ denotes the number of data while the weights $W_i \in \mathbb{R}^{p_i \times p_{i+1}}$ and biases $b_i \in \mathbb{R}^{1 \times p_{i+1}}$ represent the parameters of the neural network. Moreover, let us focus on the sum of squared errors loss function

where $Y \in \mathbb{R}^{n \times p_{\ell+1}}$ corresponds to the output data.

Back Propagation

The gradient of $\mathcal{L}$ with respect to $F$ is then given by

Using chain rule, the gradient of $\mathcal{L}$ with respect to $W_\ell$ is given by

and the gradient of $\mathcal{L}$ with respect to $b_\ell$ is given by

where $\mathbb{1} \in \mathbb{R}^{n \times 1}$ is a matrix filled with ones. The gradient of $\mathcal{L}$ with respect to $A_\ell$ is given by

Consequently, the gradient of $\mathcal{L}$ with respect to $H_\ell$, denoted by $G_{\ell-1}$, is given by

Here, we are using the fact that the derivative of $\tanh(x)$ with respect to $x$ is given by $1-\tanh^2(x)$. Moreover, $\odot$ denoted the point-wise product between two matrices. The above procedure can be repeated to give us the backpropagation algorithm

Moreover, the gradient of $\mathcal{L}$ with respect to $X$ is given by

All data and codes are publicly available on GitHub.