The Neural Process Family - A PhD Journey: Learning to Learn

This post is a follow-up on the [[Gaussian Processes from A to Z]] article. You are welcome to read it before continuing with the Neural Process Family. ### The ideal solution The Gaussian Process uses a set of observations to condition a prior distribution over functions. An ideal solution would not require any prior knowledge of the underlying stochastic process, or any assumptions on it, but would instead infer it from the data. Moreover, as Gaussian Processes are heavily limited in scalability and inefficient for highly dimensional or very large datasets, an ideal method would scale well and be applicable to complex problems, as neural networks are. Simply put, such a solution would take the great fitting capability with the scalability of neural networks, and combine it with the conditioning on observations as well as the uncertainty modelling of Gaussian Processes. > The framework of stochastic processes is appealing, because Bayes' rule allows one to reason consistently about the predictive distribution over $f$ imposed by observing $O$ under a set of probabilistic assumptions. This allows the model to be data efficient, an uncommon characteristic in most deep learning models. [^1] [^1]: [Conditional Neural Processes (Garnelo et al.)](https://arxiv.org/abs/1807.01613) A recent family of algorithms introduced as the *Neural Process Family*, which some of them can be seen as stochastic processes, take the best of both worlds from Neural Networks and Gaussian Processes. The general form of Neural Processes efficiently learns a distribution over functions from data, which can be thought of as prior knowledge that is then narrowed down at test time by exploiting observations. Like GPs, they also provide a measure of uncertainty and allow sampling from a distribution, but unlike them, they do not need prior beliefs on the stochastic process. Similarly to Neural Networks, they can effectively model complex functions and are fast to infer at test time, but unlike them, they are able to quickly adapt to new functions after training. ### How Neural Processes tackle these challenges The Neural Process Family can be seen through the lens of Meta-Learning. It is a paradigm of being able to integrate information at test time while exploiting the inherent structure of related tasks. It allows learning a new task rapidly and with few examples. In a regression context, a task is defined as a discretized function, where each sample is an $\{x_i, f(x_i)\}$ tuple. ![[Task definition in Meta-Learning]] In their general form, Neural Processes condition a posterior distribution over target samples with a global latent variable, which is sampled from a distribution conditioned on the context observations. Those observations are embedded into feature space by a shared and static Neural Network, and then aggregated into a permutation-invariant representation. The latter is used to parameterize a normal distribution, from which the latent variable can be sampled. Eventually, the entire target set is conditioned on the observations through the latent variable, such that an entire function can be sampled from a multivariate normal distribution, for each new latent variable. ![Neural Processes: Graphical model](https://i.imgur.com/s6WSCse.png) Even though Neural Processes are a very reasonable middle-ground for regression and classification, between kernel-based methods such as Gaussian Processes and Neural Networks, they suffer from the lack of mathematical guarantees that GPs provide because of the introduction of neural networks. Furthermore, the common NPs are not able to accurately model a complex function even when given enough observations to recover the ground-truth, unless they incorporate an attention mechanism as in the *Attentive Neural Processes*[^2]. [^2]: [Attentive Neural Processes (Kim et al.)](https://arxiv.org/abs/1901.05761)