*This post was originally published by Saurav Jha at Towards Data Science*

As the deep learning community aims to bridge the gap between human and machine intelligence, the need for agents that can adapt to continuously evolving environments is growing more than ever. This was evident at the ICML 2020 which hosted two different workshop tracks on continual and lifelong learning. As an attendee, the key takeaways that I amassed and that I believe are going to shape the imminent developments in the field are two folds: (a) *experience replay* (whether real or augmented) is integral for optimal performances, and (b) a temporally evolving agent must be aware of *task semantics*. Throughout this blog post, I will try to shed light on the effectiveness of these traits for continual learning (CL) agents.

Photo by Austin Distel on Unsplash

*[Although a background on the above terms would help, the post is intended for readers with no prior knowledge of continual learning literature.]*

**A quick refresher:** In a task-incremental setup, a continuously learning agent at a time step *‘t’* is trained to recognize the tasks *1, .., t-1, t* while the data for the tasks *1, …, t-1* may or may not be available. Such learning dynamic has two main hurdles to overcome. The first of these being *forward transfer (FT)* which measures how learning incrementally up to task *t* influences the agent’s knowledge about it. In terms of performance, a **positive** FT suggests that the agent should deliver better accuracy on the task *t* if allowed to learn it incrementally through tasks *1, …, t-1*.

The other desirable feature is the *backward transfer (BT)* which measures the influence that learning a task *t* has on the performance of a previous task. A **positive** BT means that learning a new task *t* would increase the performance of the model on the previously learned tasks *1, …, t-1*. This compromise between learning a new task while also preserving the knowledge on previously learned tasks is referred to as **plasticity-stability** trade-off.

Three trade-offs for a continual learning agent: Scalability comes into play when a computationally efficient agent is equally desirable.

Based on the steps taken while training on an incremental task, continual learning literature comprises mainly of two categories of agents to handle the aforementioned trade-off: (a) *experience replay-based* agents usually store a finite amount of examples (either real or generative) from previous tasks and mix these together with the train data of the new task, and (b) *regularisation-based* methods use additional loss terms to consolidate previous knowledge. Keeping these in mind, let us now dive into the real questions!

## 1. Why memory rehearsal-based methods work better?

A spotlight in the field of CL at ICML 2020 was the work of Knoblauch *et al.* who by the means of set theory show that an optimal continual learning algorithm needs to solve the NP-hard problem of the set intersection decision problem, *i.e.,* given two tasks A and B, it needs to discern the parameters that are common to learning of both A and B (A ∩ B). However, determining this is at least as hard as determining whether A ∩ B is empty or not (and can possibly be thought of as a generalization of the Hitting Set Problem?) and the solution requires a perfect memory of previous task examples.

Image Source: Knoblauch *et al (2020). Left: an optimal CL algorithm searches for parameters satisfying the task distributions of all observed tasks. Right: Replay-based CL algorithms try to find the parameters that satisfy the reconstructed approximation — SAT(Q1:3) — of the actual task distributions (SAT1:3).*

Such a perfect memory facilitates the reconstruction of an approximation for the joint distribution over all observed tasks so that the algorithm now effectively learns to solve a single temporally distributed task, *i.e.,* for a time step *t*, this amounts to finding common representations across task distributions spanning over *1:t*. Our work from the CL workshop further advocates for the empirical effectiveness of replay-based methods in the context of human activity recognition [2].

Alongside set theory, the benefit of replay can also be looked at through the dynamics of parameter training by treating learning continuously as a *credit assignment problem*. As we know, gradient descent works by iteratively updating the parameters of a neural network with the objective of minimizing the overall loss on the train set. The training process can thus be viewed as a tug-of-war game where the objective function leads the values of each parameter to either increase or decrease, with a larger positive value indicating that the parameter should be assigned more credit and is more important.

Image source: Soccer Coach

At a given incremental time step, we can thus view each task as a team trying to pull the tug with a tension equivalent to the momentum that the training algorithm requires for minimizing the loss on the task. A repercussion of this is that at each incremental step, the model needs to be evaluated on all previous and current tasks so as to balance the tension. In case a given task is absent at a particular instance, the parameter space of the model will be updated to be occupied by the remaining tasks. The simultaneous presence of data from all previous tasks in experience replay-based methods thus helps at the better balancing of the tension among all sides of the tug-of-war game while no single task objective fully dominates over the training criterion.

## 2. How task semantics affect the performance of a CL agent?

A yet another highlight from the CL workshop has been the work of Ramesesh *et al.* (2020) investigating how the similarity between tasks influence the degree of forgetting. They conclude that a network has maximum forgetting when the similarity of representations between a previous task and a subsequent task is intermediate.

To understand this, we need to think CL of subsequent tasks in terms of the components of weight vectors learned by the model. For tasks that are unrelated, the learned weight vectors remain orthogonal to each other while for those with high similarities, the weight vector components have minimal angular separation. The only component of the weight vector *θ* that is affected by the gradient descent training is the component that lies in the training data subspace and the one that is least affected by the training is that orthogonal to the train data subspace (see the figure below adapted from their talk).

The components of the weight vector for a toy linear regression model

Ramesesh *et al.* offer two descriptive CL setups to support their hypothesis. In Setup 1, where the model is trained to classify ship-truck as the first task, and then cat-horse or plane-car as the second task, we see that the cat-horse recognition task suffers more forgetting. In setup 2, where the model is first trained to recognize deer-dog-ship-truck, followed by plane-car recognition, the performance degrades the most for ship-truck.

*Two incremental learning setups of Ramasesh et al. (2019): (a) Setup 1 trains the model first on the ship-truck classification problem followed by Task 2 which can be either cat-horse or plane-car classification, (b) Setup 2 trains the model to recognize deer, dog, ship and truck first followed by plane-car recognition.*

The authors make a point that in setup 1, the model builds its representations only for vehicles, and thus the increasingly dissimilar representations for animals (cat-horse) in the second task cause more forgetting of the previously learned vehicle representations. Setup 2, however, involves training the model simultaneously on vehicles as well as animals, and thus the representations of animals now occupy a different region in the latent space than the vehicles. As a result, when presented with the latter task, the learned representations for animals are orthogonal to those for plane-car and suffer lesser degradation.

The rest of this section tries to explain this from a *transfer-interference* point of view. Riemer et al. (2019) were the first to look at continual learning from a transfer-interference trade-off viewpoint. To grasp this, let us first dive into the limitations of the *stability-plasticity* dilemma. As we saw before, the dilemma states that the stability of the learned model can be improved by reducing forgetting, *i.e.,* so far it keeps a check on the transfer of weights due to learning of the current task while minimizing their interference due to the sharing of weights that are important to the previous tasks.

However, since we have limited knowledge of what the future tasks may look like, minimizing the weight sharing for previous tasks tackles only half the problem — a future task that is quite related to one of the previously learned tasks might demand further sharing of these weights and the model must be able to do so without disrupting the performance on the previous tasks. We notice that there is an obvious need to extend the temporal limitations of the stability-plasticity dilemma so as to account for the uncertainty from the future tasks.

Transfer-Interference trade-off takes care of the backward interference due to learning of an incremental task while also keeping check of the transfer of representations among weights so that they do not harm future learning. Riemer *et al.* thus show that tasks that are learned using the same weight components have a high potential for both interference and transfer between examples while those learned using dissimilar components suffer lesser transfer and interference.

Keeping the above point of view in mind, let us now look at the two CL setups of Ramasesh *et al.* In setup 1, the ship-truck classification task is dissimilar to the incremental cat-horse task and since the model tries to learn them using the same weight component, the high interference causes larger forgetting of the previous task.

In setup 2, however, we see that the model is forced to have a dissimilar representation for deer-dog than ship-truck. Since the representations for plane-car are more similar to the ship-truck classification task and are to be learned using the same weight component, this catalyzes the transfer of weights between them thus resulting in larger forgetting. On the other hand, the representations for deer and dog have components orthogonal to those of plane and car, and thus are unaffected by the inhibited transfer of weights between these.

**Conclusion:** Put short, we saw how a continuously learning agent faces a credit assignment problem at each training step and how experience replay reinforces the credibility of each task in hand. Further, the semantics of the tasks play an important role in the amount of forgetting that an agent will suffer and this can be explained from the transfer-interference point of view. As the field continues sprouting towards large-scale and domain-independent learning, a better understanding of these trade-offs is indeed the key to more advanced training strategies such as meta-learners [3].

## References

- Knoblauch, J., Husain, H., & Diethe, T. (2020). Optimal Continual Learning has Perfect Memory and is NP-hard.
*ArXiv, abs/2006.05188*. - Jha, S., Schiemer, M., & Ye, J. (2020). Continual Learning in Human Activity Recognition: an Empirical Analysis of Regularization.
*ArXiv, abs/2007.03032*. - Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y., & Tesauro, G. (2019). Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference.
*ArXiv, abs/1810.11910*. - Ramasesh, V., Dyer, E., & Raghu, M. (2020). Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics.
*ArXiv, abs/2007.07400*.

*This post was originally published by Saurav Jha at Towards Data Science*