Modern datasets are rapidly growing in size and complexity, and there is a pressing need to develop new statistical methods and machine learning techniques to harness this wealth of data. This work presents a novel regression framework for encoding massive amount of data into a small number of “hypothetical” data points. While being effective, the resulting model is conceptually very simple and is built upon the seemingly self-contradictory idea of making Gaussian processes “parametric”. This simplicity is important specially when it comes to deploying machine learning algorithms on big data flow engines such as MapReduce and Apache Spark. Moreover, it is of great importance to devise models that are aware of their imperfections and are capable of properly quantifying the uncertainty in their predictions associated with such limitations.

Methodology

To address the most fundamental shortcoming of Gaussian processes, namely the lack of scalability to “big data”, we propose to use two Gaussian processes rather than one;

1) A Gaussian process $u(x)$ in its classical sense whose hyper-parameters are trained using a “hypothetical dataset” and the corresponding negative log marginal likelihood. This Gaussian process is also used for prediction by conditioning on the hypothetical data.

2) A “parametric Gaussian process” $f(x)$ that is used to generate the hypothetical dataset consumed by $u(x)$.

One could think of $f(x)$ as the “producer” of the hypothetical data and $u(x)$ as the “consumer” of such data. Therefore, $u(x)$ never sees the real data, rendering the size of the real dataset irrelevant. In fact, $f(x)$ is the one that sees the real data and transforms it into the hypothetical data consumed by $u(x)$.

Hypothetical Data

At iteration $n$ of the algorithm, let us postulate the existence of some hypothetical dataset $\{\mathbf{z},\mathbf{u}_n\}$ with

Here, $\mathbf{m}_n$ is the mean of the hypothetical data and $\mathbf{S}_n$ is the covariance matrix. Moreover, $\mathbf{z} = \{z^i\}_{i=1}^M$, $\mathbf{u}_n = \{u^i_n\}_{i=1}^M$, and $M$ is the size of the hypothetical data. The size $M$ of the hypothetical data is assumed to be much smaller than the size $N$ of the real dataset $\{\mathbf{x},\mathbf{y}\}$. The locations $\mathbf{z}$ of the hypothetical dataset are obtained by employing the k-means clustering algorithm on $\mathbf{x}$ (or a smaller subset of it) and are fixed throughout the algorithm.

Consumer of Hypothetical Data

Let us start by making the prior assumption that

is a zero mean Gaussian process with covariance function $k(x,x';\theta_{n-1})$ which depends on the hyper-parameters $\theta_{n-1}$. The hyper-parameters can be updated from $\theta_{n-1}$ to $\theta_{n}$ by taking a step proportional to the gradient of the negative log marginal likelihood

where $\mathbf{K}_{n-1} = k(\mathbf{z}, \mathbf{z}; \theta_{n-1})$. It is worth highlighting that we are using the mean $\mathbf{m}_n$ of the hypothetical data $\mathbf{u}_n$ in the formula for the negative log marginal likelihood rather than the actual hypothetical data $\mathbf{u}_n$. Moreover, predictions can be made by conditioning on the hypothetical data and obtaining

where $\mathbf{q}^T_n = k(x,\mathbf{z};\theta_n)$ and $\mathbf{K}_n = k(\mathbf{z},\mathbf{z};\theta_n)$. In fact, for prediction purposes, we need to maginalize $\mathbf{u}_n$ out and use

where

and

Producer of Hypothetical Data

Let us define a parametric Gaussian process by the resulting conditional distribution

The mean $\mathbf{m}_n$ and the covariance matrix $\mathbf{S}_n$ of the hypothetical dataset can be updated by employing the posterior distribution

resulting from conditioning on the observed mini-batch of data $\{\mathbf{x}_{n+1},\mathbf{y}_{n+1}\}$ of size $N_{n+1}$; i.e.,

It is worth mentioning that $\mu_n(\mathbf{z}) = \mathbf{m}_n$ and $\Sigma_n(\mathbf{z},\mathbf{z}) = \mathbf{S}_n$. The information corresponding to the mini-batch $\{\mathbf{x}_{n+1}, \mathbf{y}_{n+1}\}$ is now distilled in the parameters $\mathbf{m}_{n+1}$ and $\mathbf{S}_{n+1}$. The algorithm is initialized by setting $\mathbf{m}_0 = \mathbf{0}$ and $\mathbf{S}_0 = k(\mathbf{z},\mathbf{z};\theta_0)$ where $\theta_0$ is some initial set of hyper-parameters. Therefore, initially $f_0(x) = u_0(x)$.

Illustrative Example

Illustrative example: (A) Plotted are 6000 training data generated by random perturbations of a one dimensional function. (B) Depicted is the resulting prediction of the model. The blue solid line represents the true data generating function, while the dashed red line depicts the predicted mean. The shaded orange region illustrates the two standard deviations band around the mean. The red circles depict the resulting mean values for the 8 hypothetical data points after a single pass through the entire dataset while mini-batches of size one are employed per each iteration of the training algorithm. It is remarkable how the training procedure places the mean of the hypothetical dataset on top of the underlying function.

Conclusions

This work introduced the concept of parametric Gaussian processes (PGPs), which is built upon the seemingly self-contradictory idea of making Gaussian processes “parametric”. Parametric Gaussian processes, by construction, are designed to operate in “big data” regimes where one is interested in quantifying the uncertainty associated with noisy data. The effectiveness of the proposed approach was demonstrated using an illustrative example with simulated data and a benchmark dataset in the airline industry with approximately $6$ million records (see here).

Acknowledgements

This work received support by the DARPA EQUiPS grant N66001-15-2-4055 and the AFOSR grant FA9550-17-1-0013. All data and codes are publicly available on GitHub.