This post was originally published by Iulia Turc at Towards Data Science
The traditional von Neumann architecture differentiates between a CPU (Central Processing Unit) and three levels of memory: registers — very fast, but with storage capability limited to a few values; main memory (e.g. RAM)— faster, with enough storage to accommodate for instructions and data to run a program, and external memory (e.g. hard drive) — slow, but with room for virtually all data used by a computer.
Memory-Augmented Neural Networks (MANNs) are differentiable versions of the von Neumann architecture (more on this in the next section). The bulk of the neural network can be thought of as the CPU. Certain architectures like RNNs (Recurrent Neural Networks) have built-in memory that is analogous to the registers, storing short-term information. The neural memory is separate from the rest of the model parameters and, similarly to the RAM, stores long-term information. It consists of an array of memory slots (i.e., a matrix) and, most commonly, stores continuous representations of information (text, images, etc.)
The interaction of the neural memory with the external world is mediated by a controller. Figure 1 in Graves et al. 
The component that directly interacts with the neural memory via read and write operations is called a controller. In early work, the controller coincided with the rest of the model (i.e. all the parameters outside the memory), so it acted as the interface between the memory and the “external world”. It was often implemented as a recurrent neural network. More recently, with the advent of massive Transformer-based architectures, the controller is only a small subset of the model, and mediates the communication between the memory and the rest of the network.
What is a differentiable architecture?
MANNs are differentiable, von Neumann architectures are not — but what exactly does this mean? You might recall the following definitions:
A differentiable function of a real variable is a function whose derivative exists at each point in its domain — Wikipedia.
The derivative of a function of a real variable measures the sensitivity of the function value (output value) with respect to a change in its argument (input value) — Wikipedia.
The derivative of a function at a specific point. Image from Wikipedia.
The von Neumann architecture performs non-differentiable operations. For instance, consider a read operation: when the CPU fetches its next instruction from RAM, it specifies an address (the input) and receives back an instruction (the output). The input domain is thus unsigned integers, so the operation is not defined over a real variable. According to the definition above, differentiability is out of the question.
Making operations differentiable: soft reads and writes
The core reason why RAM reads are not differentiable is that they operate over a discrete space of addresses. Neural memories propose an adjustment:
Instead of reading from a single entry, perform a weighted read from all entries.
For each memory slot i, the controller specifies a real-valued weight wᵢ such that all weights sum up to 1. This changes the input of the read operation from a single integer value (the address) to a vector of real values (the per-slot weights), which is the first requirement for differentiability. Note that this revised operation is strictly more general: when a single weight is set to 1.0 and all others to 0.0, we are effectively reading from a single entry. The same reasoning applies to writes: instead of writing a value x to a single memory slot, we update each entry i by a weighted value wᵢ * x.
These operations are called soft reads and writes, due to the continuous nature of the weights wᵢ. This contrasts with the hard reads and writes to RAM, where the controller makes a hard decision regarding the memory slot to operate on.
Computing the soft weights: content- vs location-based addressing
RAM is accessed based on location — read operations specify the exact address to read from. Neural memories are typically accessed based on content — queries specify what to read, not where to read from.
How does the controller compute the per-slot weights wᵢ?
First, a note on terminology: the mechanism to compute weights wᵢ is often referred to as memory addressing, since it determines which memory slots are addressed, and how much attention is paid to each. Addressing neural memories can be done based on content or location.
Illustration by the author.
With content-based addressing, the weights wᵢ reflect how relevant the content of slot i is in resolving an incoming query. For instance, for a question answering task, the memory query could be an embedding of the actual question. The controller must then upweight the memory slots that are good answer candidates. Most commonly, wᵢ is the dot product or cosine similarity between the embeddings of the content in slot i and the query. Finally, all weights are normalized via softmax so that they sum up to 1.
With location-based addressing, the weights wᵢ reflect how much attention to be paid to location i, irrespective of its content. Although less common, neural memories are designed to support location-based addressing when the task they have to solve involves simple logical or arithmetic operations like adding two variables x and y — in this case, the controller should be able to retrieve the operands from memory regardless of their exact value. In the following episodes, I will dive deeper into the specifics of computing location-based weights wᵢ.
Reducing the computational cost: sparse reads and writes
Making reads and writes differentiable comes with a computational cost. Each query is now resolved in linear time O(N), where N is the number of memory slots (in contrast, hard reads and writes take constant O(1) time). When the input to a network is a sequence (e.g. a text document) of length L, it is common to make one query for each element in the sequence — which brings the cost up to O(N*L). During training, soft reads and writes are also memory-inefficient; computing gradients for the entire memory requires making a copy of it.
Follow-up research focused on reducing the O(N) cost to either O(log N) (Rae et al. ) or O(sqrt N) (Lample et al. ). While the two approaches are quite different, their common ground is to operate on a subset of the memory, as opposed to all entries. In other words, they limit the number of non-zero weights wᵢ to a small constant K (somewhere between 2 and 8), and apply gradient descent only on the slots with non-zero weight.
Memory-Augmented Neural Networks have shown promising results in artificial tasks (e.g. they learn to copy a sequence a given number of times), some natural language tasks (question answering, machine translation) and some computer vision tasks (character recognition). However, they are yet to become mainstream. There are interesting research opportunities in multiple directions: reducing their computational cost, speeding up training, understanding under what circumstances they are most useful, and integrating them with the state-of-the-art Transformers.
The rest of this series will reflect upon what’s been done and where to go next. If you would like me to cover a particular paper, feel free to leave a comment below.
- Graves et al., Neural Turing Machines (2014)
- Rae et al., Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes (2016)
- Lample et al., Large Memory Layers with Product Keys (2019)
This post was originally published by Iulia Turc at Towards Data Science