An in-depth view of the MAML algorithm - A PhD Journey: Learning to Learn

![[maml.png]] The intuition behind the proposed approach, of learning the parameters of any standard model via meta-learning, is that some internal representations are more transferrable than others. The overall goal of the algorithm is to find model (optimizee) parameters that are *sensitive* to changes in the novel task loss (small changes induce large improvements), on any task drawn from $p(\mathcal{T})$. This should have the effect of **capturing internal features that are broadly applicable to all tasks in $p(\mathcal{T})$**. ![[maml_algo.png]] As seen on Algorithm 1, the step size $\alpha$ may be fixed or meta-learned. The meta-objective is as follows: $ \min_{\theta} \sum_{\mathcal{T_i \sim p(\mathcal{T})}} \mathcal{L}_\mathcal{T_i}(f_{\theta_i^{\prime}})= \sum_\mathcal{T_i \sim p(\mathcal{T})}\mathcal{L}_\mathcal{T_i}(f_{\theta - \alpha \nabla_\theta \mathcal{L}_\mathcal{T_i}(f_\theta)}) $ **The meta-gradient is computed here, as being the gradient of the gradient of each task.** The clever trick here is that meta-optimization is performed over the model parameters $\theta$ whereas the objective is computed using the updated model parameters $\theta^\prime$, suich that it optimizes the model parameters in a way that one or a small number of gradient steps on a new task will produce maximally effective behavior on that task. Line 8 of Algorithm 1 shows the stochastic gradient descent update for the model parameters $\theta$ across tasks. ==This is the meta-gradient!== Basically on line 8, the gradient of the sum of the loss of each task is computed with respect to $\theta$. The losses are computed with the updated model $f(\theta^\prime)$, which means that $\nabla_\theta$ on line 8 represents the amount of change on the loss when the model is updated! Therefore the updated meta-gradient should induce the largest possible change on the loss function of each task, and that is why this paper is so clever. It is like looking in the future, going back to the present, and setting the exact step size that will land the model on the minima. This also ensures that the new meta-gradient will not greatly improve the loss on one task to the detriment of another task. To compute the meta-gradient, an additional backward pass through $f$ is necessary to compute the Hessian-vector products (one first pass for $\nabla_\theta \mathcal{L}_\mathcal{T_i}(f_\theta)$ and another pass for that gradient with respect to $\theta$). In their experiments, they compare this to just using a first-order approximation. *Note that a first-oder approximation means a Taylor series of expansion to the first order only, not a first derivative!*