This is a short tutorial on the following topics in Deep Learning: Neural Networks, Recurrent Neural Networks, Long Short Term Memory Networks, Variational Auto-encoders, and Conditional Variational Auto-encoders. The full code for this tutorial can be found here.

Neural Networks

Consider the following deep neural network with two hidden layers.

Here, $\mathbf{x}_p \in \mathbb{R}^{N \times 1}$ denotes dimension $p=1,\ldots,P$ of the input data $\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_P] \in \mathbb{R}^{N \times P}$. The first hidden layer is given by

where $\mathbf{h}_q^{(1)} \in \mathbb{R}^{N\times 1}$ denotes dimension $q=1,\ldots,Q^{(1)}$ of the matrix

In matrix-vector notations we obtain $\mathbf{H}^{(1)} = h(\mathbf{X} A^{(1)} + b^{(1)})$ with $A^{(1)} = [a_{pq}^{(1)}]_{p=1,\ldots,P,q=1,\ldots,Q^{(1)}}$ being a $P \times Q^{(1)}$ matrix of multipliers and $b^{(1)} = [b_1^{(1)},\ldots,b^{(1)}_{Q^{(1)}}] \in \mathbb{R}^{1 \times Q^{(1)}}$ denoting the bias vector. Here, $h(x)$ is the activation function given explicitly by $h(x) = \tanh(x)$. Similarly, the second hidden layer $\mathbf{H}^{(2)} \in \mathbb{R}^{N \times Q^{(2)}}$ is given by $\mathbf{H}^{(2)} = h(\mathbf{H}^{(1)} A^{(2)} + b^{(2)})$ where $A^{(2)}$ is a $Q^{(1)} \times Q^{(2)}$ matrix of multipliers and $b^{(2)} \in \mathbb{R}^{1 \times Q^{(2)}}$ denotes the bias vector. The output of the neural network $\mathbf{F} \in \mathbb{R}^{N \times R}$ is given by $\mathbf{F} = \mathbf{H}^{(2)} A + b$ with $A \in \mathbb{R}^{Q^{(2)} \times R}$ and $b \in \mathbb{R}^{1 \times R}$. Moreover, we assume a Gaussian noise model

with mutually independent $\mathbf{\epsilon}_r \sim \mathcal{N}(\mathbf{0}, \sigma_r^2 \mathbf{I})$. Letting $\mathbf{y}:= [\mathbf{y}_1;\ldots;\mathbf{y}_R] \in \mathbb{R}^{N R \times 1}$, we obtain the following likelihood

where $\Sigma =$ diag$(\sigma_1^2,\ldots,\sigma_S^2)$ and $\mathbf{f}:= [\mathbf{f}_1;\ldots;\mathbf{f}_R] \in \mathbb{R}^{N R \times 1}$. One can train the parameters of the neural network by minimizing the resulting negative log likelihood.

Illustrative Example

The following figure depicts a neural network fit to a synthetic dataset generated by random perturbations of a simple one dimensional function.

Neural network fitting a simple one dimensional function.

Recurrent Neural Networks

Let us consider a time series dataset of the form $\{\mathbf{y}_t: t=1,\ldots,T\}$. We can employ the following recurrent neural network

to model the next value $\hat{\mathbf{y}}_t$ of the variable of interest as a function of its own lagged values $\mathbf{y}_{t-1}$ and $\mathbf{y}_{t-2}$; i.e., $\hat{\mathbf{y}}_t = f(\mathbf{y}_{t-1}, \mathbf{y}_{t-2})$. Here, $\hat{\mathbf{y}}_t = \mathbf{h}_t V + c$, $\mathbf{h}_t = \tanh\left(\mathbf{h}_{t-1} W + \mathbf{y}_{t-1} U + b\right)$, $\mathbf{h}_{t-1} = \tanh\left(\mathbf{h}_{t-2} W + \mathbf{y}_{t-2} U + b\right)$, and $\mathbf{h}_{t-2} = \mathbf{0}$. The parameters $U,V,W,b,$ and $c$ of the recurrent neural network can be trained my minimizing the mean squared error

Illustrative Example

The following figure depicts a recurrent neural network (with $5$ lags) learning and predicting the dynamics of a simple sine wave.

Recurrent neural network predicting the dynamics of a simple sine wave.

Long Short Term Memory (LSTM) Networks

A long short term memory (LSTM) network replaces the units $\mathbf{h}_t = \tanh\left(\mathbf{h}_{t-1} W + \mathbf{y}_{t-1} U + b\right)$ of a recurrent neural network with

where

is the output gate and

is the cell state. Here,

Moreover,

is the external input gate while

is the forget gate.

Illustrative Example

The following figure depicts a long short term memory network (with $10$ lags) learning and predicting the dynamics of a simple sine wave.

Long short term memory network predicting the dynamics of a simple sine wave.

Variational Auto-encoders

Let us start by the prior assumption that

where $\mathbf{z}$ is a latent variable. Moreover, let us assume

where $\mu_2(\mathbf{z})$ and $\Sigma_2(\mathbf{z})$ are modeled as deep neural networks. Here, $\Sigma_2(\mathbf{z})$ is constrained to be a diagonal matrix. We are interested in minimizing the negative log likelihood $-\log p(\mathbf{y})$, where

However, $-\log p(\mathbf{y})$ is not analytically tractable. To deal with this issue, one could employ a variational distribution

and compute the following Kullback-Leibler divergence; i.e.,

Using the Bayes rule

one obtains

Therefore,

Rearranging the terms yields

A variational auto-encoder proceeds by minimizing the terms on the right hand side of the above equation. Moreover, let us assume that

where $\mu_1(\mathbf{y})$ and $\Sigma_1(\mathbf{y})$ are modeled as deep neural networks. Here, $\Sigma_1(\mathbf{y})$ is constrained to be a diagonal matrix. One can use

to generate samples from

Illustrative Example

The following figure depicts the training data and the samples generated by a variational auto-encoder.

Training data and samples generated by a variational auto-encoder.

Conditional Variational Auto-encoders

Conditional variational auto-encoders, rather that making the assumption that

start by assuming that

where $\mu_0(\mathbf{x})$ and $\Sigma_0(\mathbf{x})$ are modeled as deep neural networks. Here, $\Sigma_0(\mathbf{x})$ is constrained to be a diagonal matrix.

Illustrative Example

The following figure depicts the training data and the samples generated by a conditional variational auto-encoder.

Training data and samples generated by a conditional variational auto-encoder.

All data and codes are publicly available on GitHub.