1 , 1 , 1
1 HUN-REN Wigner Research Centre for Physics, Department of Computational Sciences, Computational Systems Neuroscience Lab, Budapest, Hungary
Sequential neural learning faces two challenges with contradicting goals: separating and sharing. First, networks will forget previously learned tasks as their globally stored memories are overwritten during updates to the current task. However, various mechanisms such as targeted inhibition or orthogonal weight updates explicitly separate representations into orthogonal subspaces in a population of neurons preserving earlier memories. Second, shared features between tasks should be stored once and reused, a primitive form of transfer learning, facilitated by optimal energy and wiring resource consumption. Although these two contradicting learning principles are evidenced to be computationally similar in animal and artificial networks, deeper understanding of the relation between continual and transfer learning is lacking. To answer how networks develop together orthogonal and shared representations, we examined hierarchically structured composite data such as classification in a visual scene or solving a complex cognitive problem. These tasks can be tackled with a hierarchical neural network that builds computations and features step by step, famously discovered in deep learning. Hierarchical networks, a characteristic element in the structure of the mammalian brain, are particularly well suited to examine local feature-orthogonalization and -sharing in a stepwise, controlled manner on realistic data. Here we show first that repeating the tasks sequentially, orthogonalized representations gradually develop while currently irrelevant memory is preserved. In particular, during training lower layers of the hierarchy orthogonalize feature representations early which helps orthogonalize higher level category layers. Catastrophic forgetting is thus overcome in hierarchical networks by spontaneous cascade orthogonalization in the order of features unfolding throughout the hierarchy. Second, common features align and collapse into reusable shared abstractions for the higher layers. Using varying class combinations and tunable overlap between tasks in the handwritten MNIST dataset, we show that if data complexity and algorithmic computational capabilities are matched, a combination of complementary orthogonalization and sharing of representations spontaneously solves catastrophic forgetting within a behaviourally relevant duration. These results should contribute to understanding early visual neurodevelopment and cognitive computations in the prefrontal cortex.