Synaptic Theory of Chunking in Working Memory

Working memory (WM) is hypothesized to be a distinct capacity for holding and manipulating multiple pieces of information, which is crucial for human cognitive abilities such as verbal communication, reading comprehension, and abstract reasoning [1–3]. Paradoxically, however, people typically cannot simultaneously hold more than four items in WM [4]. For example, repeating several words or digits is practically effortless and mistake-free, but for lists of five random words, people begin making mistakes [5–7]. How, then, are people able to process much larger streams of inputs, such as long passages of text or movies? One attractive idea is chunking, i.e., organizing several items into higher-level units [8–13]. Sometimes chunks are stored in long-term memory due to previous experience [14, 15], e.g., familiar expression like “Oh my God” or “Easier said than done” can be processed as a coherent unit rather than individual words. These pre-existing chunks could be thought of as having stable memory representations learned and consolidated over time, and could therefore be encoded and processed as a single item. However, conceptually more challenging is the phenomenon of spontaneous chunking, where novel combinations of items are grouped into separate units “on the fly”, as when a phone number is divided into chunks of 2-3 digits each, or words in a sentence are combined into units based on their syntactic role, such as “–a little boy–was dressed – in a green shirt”. Indeed, this sentence is much easier to remember than a random sequence of nine words. Surprisingly, a minor manipulation like introducing slight pauses between presentations of consecutive groups of items is enough to trigger chunking and the corresponding increase in capacity [16–19]. In this study, we addressed two interrelated questions inspired by the above considerations: how spontaneous chunking might emerge in the brain and what is (if there is one) the limit for the number of items that can be held in WM when spontaneous chunking is activated.

Neuronal mechanisms of WM and the origin of WM capacity are still under debate. While the most accepted theory assumes that WM is carried by persistent activity of item-specific neurons [20–23], we propose that a more economic and robust mechanism is to rely on short-term plasticity (STP) in item-specific synapses [24] (see [25] for a recent review of activity-silent WM). When several items are loaded into WM, rather than having all of the neurons persistently active, information could be maintained by periodic reactivations of the corresponding clusters in the form of population spikes [24, 26]. After each reactivation of a certain cluster, the recurrent self-connections in this cluster remain facilitated, allowing it to bounce back after a period of silence when other clusters activate. The largest possible number of co-active clusters, i.e., the WM capacity, is determined in this theory by the longest possible time between consecutive reactivations for each cluster, which in turn depends on STP time constants [26]. In the current contribution, we extend the STP theory of WM by including longer-lasting forms of facilitation, such as synaptic augmentation (SA) [27]. In [28], it was shown that due to its slow build-up, SA level in recurrent self-connections encodes the order of presentation of stimuli in WM. While SA does not significantly change the maximal possible number of coactivating clusters, i.e., the basic WM capacity, it allows the network to selectively switch some of the clusters off for a longer period of time, without fully erasing information about their prior activity from the recurrent selfconnections [28]. Here, we will show that SA enables consecutive chunks to be activated one after another by switching on and off specialized chunking clusters that serve as controls, and in this way enhance the effective WM capacity. In the next section, we demonstrate this mechanism in a simplified neural network model of WM and show how much WM capacity can be increased by chunking compared to the basic regime.

Network model of working memory and chunking

Following our previous work on the synaptic theory of working memory [24, 26, 28], we consider a recurrent neural network model (RNN) where memory items are represented by specific clusters of excitatory neurons coupled to a global inhibitory neural pool, see Fig. 1(a) and Methods Sec. A. The feedback inhibition is assumed to be strong enough such that at any given moment, only one excitatory cluster can be active. To simplify the model, we neglect the overlaps between the stimulus-specific clusters, such that each cluster µ can be described by a single activity variable corresponding to the average firing rate of the corresponding neurons at a given moment Rµ(t). Furthermore, we assume that all the recurrent self-connections are dynamic [29, 30], i.e., instantaneous synaptic efficacy depends on the pre-synaptic activity within a certain time window due to a combination of short-term synaptic depression and facilitation: J Self(t) = u(t)x(t)A [29], where A is the amplitude of the recurrent strength, u(t) is the current value of release probability, and x(t) is the current fraction of the maximal amount of neurotransmitter that is available for release.

Illustration of the hierarchical working memory model.

(a) Network architecture. Stimulus clusters and chunking clusters both have recurrent self-excitations (thick sharp arrows) and reciprocal connections to the global inhibitory pool (not shown). Chunking clusters have dense but weak connections to the stimulus clusters (thin blunt arrows in the background). (b) Effective network architecture after presentation. Activities in the network selectively augment connections between stimuli within chunks and the corresponding chunking clusters, effectively forming a hierarchical structure. (c) Dynamics of the recurrent self-connections. Upon the arrival of pre-synaptic inputs (top panel), the release probability u increases, and the fraction of available neurotransmitters x decreases (left axis of the middle panel). The amplitude of the recurrent strength A gradually increases with each reactivation of the cluster (right axis of the middle panel). As a result, the total synaptic efficacy of the recurrent self-connection J Self = uxA oscillates (bottom panel). Activity traces are taken from the first stimulus cluster from the top panel of (d) below. (d) Network simulation. The first three memories are colored in blue, and the other three memories are colored in green. Shades represent external input to the cluster. Top: Memories are loaded at a uniform speed; chunking clusters are not activated. Only four out of six memories remain active in the WM. Bottom: Slight pauses after chunks activate the chunking clusters, which inhibit the stimulus clusters presented before the pause. All memories are retrieved chunk-by-chunk in the retrieval stage. The full activity trace of the synaptic variables is presented in Fig. S1.

When the cluster’s activity is high, the release probability in the corresponding recurrent connections (u) increases above its baseline level U, constituting shortterm facilitation, and the fraction of available neurotransmitters (x) decreases, representing short-term depression (Fig. 1(c)). When the cluster activity is low, both u and x variables relax towards their baseline values with time constants τf and τd, respectively (Methods Sec. A). Such transient changes in the synapses are well observed in experiments and are reported to last on the order of hundreds of milliseconds to seconds [27, 29, 31, 32].

The RNN detailed in Methods Sec. A exhibits different dynamical regimes, depending on the STP parameters and external background input. In particular, as shown in [24, 26], at high background input level, there exists a persistent activity regime where clusters have sustained elevated firing rates corresponding to loaded memory items. As the background input is lowered, there exists a low-activity regime with cyclic behavior where items that were loaded into the network via external stimuli are maintained in WM in the form of sequential brief reactivations called population spikes [33]. As the number of loaded memories increases, the network eventually fails to maintain some of them, i.e., there is a maximal number of items that can be maintained in the WM, C, which depends on the synaptic-level parameters of the RNN [26].

In addition to short-term facilitation and depression, experiments observed longer-scale forms of synaptic facilitation in cortical synapses, called synaptic augmentation (SA), characterized by slow, compared to STP, build-up with activity and decay of tens of seconds [31, 32, 34–36]. We introduce SA as a small transient change in synaptic strength A that is strengthened from its baseline value due to cluster activity, similar to u, but with a much longer time constant τAτf [28] (see Fig. 1(c)).

The main modification of the current model compared to our earlier work is the introduction of distinct excitatory/inhibitory “chunking” clusters which serve to control the stimulus clusters. Both stimulus clusters and chunking clusters have recurrent excitatory self-connections. Each time the system receives a chunking cue (e.g., when there is a temporal pause in stimulus presentations), one of the chunking clusters is activated and quickly suppresses the currently active stimulus clusters, effectively grouping them into a chunk (Fig. 1(b)). Sub-sequent stimulus clusters are then free to be loaded into the network until the next chunking cue is received and another chunking cluster is activated. At the end of the presentation, only chunking clusters reactivate cyclically while all the stimulus clusters are inhibited (Fig. 1(d)).

Chunking increases working memory capacity

The main idea of the proposed chunking mechanism is that the chunking clusters can selectively activate and suppress the stimulus clusters, so that at no point in time do more than a small number of stimulus clusters reactivate as population spikes, thus not exceeding the basic WM capacity. Due to synaptic augmentation, stimulus clusters that are currently suppressed by the chunking clusters still have stronger recurrent self-connections than the ones that were not active at a given trial as long as augmentation has not disappeared. Therefore, the network can retrieve temporarily suppressed items by sequentially switching off the chunking clusters, releasing the suppressed stimulus clusters within the corresponding chunk from inhibition.

To demonstrate the chunking mechanism, we simulate a network of 16 clusters (both stimulus and chunking), 6 of which are activated consecutively with transient external input (presentation stage, the shades in Fig. 1(d)). We first consider continuous presentation of 6 inputs to the stimulus clusters with no chunking activated. At the end of the presentation, 4 of the corresponding clusters remain active in the form of periodic population spikes while two other clusters drop out of WM, corresponding to a WM capacity of 4 for the chosen values of parameters, similar to [26] (the top panel of Fig. 1(d)). Now consider presenting the same six memory items, but with a slightly longer interval between the presentation of the 3rd and 4th items, during which a chunking cluster is activated (shown in red in the bottom panel of Fig. 1(d)). We assume that the chunking cluster quickly inhibits the three stimulus clusters that were presented before it (the three blue colors) and remains the only cluster active until the items of the next chunk are presented to thenetwork (shown in green). A second chunking cluster is then activated, shown in purple. This way, the network effectively binds the stimulus clusters in each chunk to their corresponding chunking cluster (Fig. 1(b)). Such group-specific binding is akin to gating [37], where the activity of each chunking cluster gates the entire chunk of stimulus clusters via inhibition.

We assume that the fast inhibition between chunking clusters and corresponding stimulus clusters happens through strengthening the existing dense but weak inhibitory synapses between them (Fig. 1(a)). After all stimuli are presented, the network maintains reactivations of two chunking clusters while the synaptic variables of the stimulus clusters slowly decay to their baseline values. However, if the chunking cluster is suppressed within the augmentation time-window τA, the items that were inhibited by it will bounce back (Fig. 1(d) bottom panel, the blue colors in the retrieval stage). At this point in time, four clusters are active: the second chunking cluster and three stimulus clusters from the first chunk, with all items from the first chunk being successfully retrieved. When the second chunk is to be retrieved, the first chunking cluster is again activated by control input while the second chunking cluster is suppressed, allowing the stimulus clusters from the second chunk to activate. This chunking scenario allows the retrieval of all six memory items while at any given moment in time, the network maintains no more than four active clusters, not exceeding basic WM capacity. In this way, chunking increases effective working memory capacity by reducing the concurrent load on working memory, at the expense of activating higher-level representations (chunking clusters).

Above, we chose to illustrate the chunking mechanism in the periodic activity regime because the mechanistic effects of chunking clusters are most apparent with regular firing traces. Nevertheless, our proposed chunking mechanism applies to both the persistent-activity and periodic-activity regimes, with chunking clusters serving the same function in each. Note that, although we model the chunking cues here as slight pauses between presentations, in general chunking can be triggered by other cues, such as tonic variations and semantic meanings. The idea that chunking reduces the load on working memory was first introduced in the psychology literature [9, 14, 38]. Subsequently, neuroimaging studies observed that chunking reduces neural activity in upstream brain regions that process raw stimuli but increases activity in downstream regions associated with higher-level representations [18, 39, 40], which is consistent with our proposed mechanism.

Hierarchical chunking predicts a new capacity

Our model assumes that several stimulus clusters are grouped into chunks by chunking clusters. A natural question then arises: Can chunks also form meta-chunks? If so, is there a limit to how many levels of such hierarchical representations in working memory can be formed? Here we argue that the answers to both questions are affirmative and moreover, one can derive a surprisingly simple formula for the largest possible number of items in WM (Methods Sec. B):

where C is the basic WM capacity in the absence of chunking. As mentioned above, C corresponds to the number of active clusters that can be maintained in the RNN model (the top panel of Fig. 1(d) illustrates the case of C = 4), and it depends on all the synaptic-level parameters (Methods Sec. A)[26, 28].

Eq. (1) is a direct consequence of the limited amount of activity that the working memory network can sustain (Methods Sec. B) and does not depend on specific STP mechanisms. Therefore, we expect M to hold in working memory models with similar architecture but possibly different microscopic implementation from Methods Sec. A. Eq. (1) defines a new capacity for working memory that accounts for hierarchical chunking. Thus, we refer to M as the new magic number, in the original spirit of Miller [14].

Below we illustrate how the limited number of C clusters in the network constrains the total number of memory items that can be maintained and retrieved in WM. Let us consider the example corresponding to C = 4, with a capacity of M = 241 = 8. In this case, the optimal chunking structure is a binary tree with three levels (Fig. 2(a)).

Memory retrieval from a hierarchical structure.

(a) Top: Schematic of an emergent hierarchy of three levels. The top node (black) denotes the global inhibitory neural pool. The first two levels represent chunking clusters, and the lowest level represents stimulus clusters. Grey stripes denote the clusters that need to be suppressed to retrieve the 1st chunk. Blue dashed circles represent clusters that are active during the retrieval of the 1st chunk during the retrieval stage. Bottom: Architecture of the underlying recurrent neural network. (b) Simulation of the network in (a). R(k): activity trace of firing rates, color-coded to match the corresponding clusters in (a). The time-course of the traces is labeled as chunks (stimulus clusters), pauses (chunking clusters), and long pauses (meta-chunking clusters). Ib(k): activity traces of background input currents. Decreasing the background input to a cluster at level k suppresses its reactivation and removes the inhibition on its children clusters at level k − 1.

Eight memories are loaded as four chunks of two into the working memory network (the third panel in Fig. 2(b)): a slight pause in-between items of different colors (such as the 2nd and 3rd items) serves as the chunking cue to activate the chunking clusters (the second panel in Fig. 2(b)), which binds item clusters in pairs of two, similar to the chunks in the bottom panel of Fig. 1(d). However, here we introduce a slightly longer pause in-between the 4th and 5th items, during which a chunking cluster binding items 3 and 4 into a chunk is first activated, which is quickly followed by the activation of another chunking cluster to group the first two chunks into a meta-chunk (the first panel in Fig. 2(b)). In this way, after the presentation of the eight items, we have two meta-chunks, giving rise to a tree-like hierarchical structure of three levels (Fig. 2(a)).

To differentiate clusters at different levels of the hierarchy, we denote the ith stimulus clusters at the k = 3 level as and its activity as , where i is the order within the level during presentation. Similarly, the mth chunking clusters at the k = 1, 2 levels is denoted as and its activity as . The process of meta-chunking introduced in the previous paragraph can be described more precisely as follows. Immediately after presenting (item 4), chunking cluster is active and sup-Presses and (items 1 and 2, first two blue colors). Clusters and (items 3 and 4, first two green colors) are also active. Once the chunking signal is received, cluster is activated and suppresses clusters and . Furthermore, when this suppression is established, another meta-chunking cluster is activated and suppresses all the k = 2, 3 clusters that were active before it: , . When presentation is finished (around t ~ 7.5 s in Fig. 2(b)), only the k = 1 level clusters in orange, and in yellow) remain active in the working memory.

As the retrieval begins (at t ≳ 8 s in Fig. 2(b)), is suppressed by a drop in its background input , indicated by grey stripes in Fig. 2(a). Subsequently, the k = 2 clusters become active, and the network now maintains three clusters: . In the next step, (darker red) is suppressed, which reactivates the k = 3 stimulus clusters that were inhibited by . Now, the first chunk has been retrieved, and the working memory maintains four clusters: (illustrated by blue dashed circles in Fig. 2(a)-(b)). Other chunks can be retrieved in a similar manner by suppressing the corresponding higher-level chunking clusters. Note that even though hierarchical chunking reduces the load on working memory from eight to two, upon unpacking of the chunk, one still needs to maintain all the intermediate chunks that were not unpacked, in this case, and (lighter red and yellow, respectively). Similar considerations result in the above expression for M in Eq. (1) (Methods Sec. B). In the following sections, we first examine the neuroscience evidence of chunking clusters, then test the prediction of M on memory experiments.

Experimental evidence for the existence of chunking clusters

Segmentation of sensory stimuli in human memory has been extensively studied in behavioral experiments from the early days of cognitive neuroscience and psychology [14, 15], but its neural correlates have not been explored until recently [18, 41, 43–45]. The key assumption in the hierarchical working memory model is the existence of chunking clusters that segment stimuli into chunks. Our model predicts that chunking reduces the load on working memory through inhibition. Upon the firing of the chunking clusters, we expect to see a decrease in the average firing rate of the stimulus clusters. Furthermore, as stimuli continue to be presented after chunking, the average firing rate should gradually increase after the drop. Overall, the hierarchical working memory model predicts two qualitative features in the firing rates of the cluster of neurons that encode stimuli (such as in the bottom panel of Fig. 1(d)): (1) there should be a “dip” in the activities of stimulus clusters upon the firing of the chunking clusters; (2) there should be a continuous “ramping-up” of activities following the dip.

Thanks to advances in single-neuron recording technologies, we can now test our hypothesis using data collected from drug-resistant epilepsy patients [41]. Consider the experiment reported in [41], where subjects are asked to watch a series of movie clips, each consisting of two episodes separated by a “cut” in the middle of the movie. Such movie cuts serve to induce cognitive boundaries for event segmentation in episodic memory. The authors in [41] identified a group of neurons in the medial temporal lobe that fire selectively at these boundaries and termed them “cognitive boundary” neurons. If these neurons segment episodic memories in a manner similar to how chunking clusters segment working memory in our model, then we should also observe a decrease in the firing rates of the stimulus neurons upon the firing of the cognitive boundary neurons. In [41], although the boundary neurons can be unambiguously identified by aligning their responses to movie cuts, it is difficult to pinpoint stimulus neurons due to the continuous nature of the visual stimulus. Therefore, we study the putative effect of boundary neurons on the rest of the system by aggregating neurons that are detected but not classified as boundary neurons. We align all the neurons to the movie cuts. Upon averaging over subjects and trials, we find that after the peak in the firing rate of boundary neurons (top panel of Fig. 3 (a)), about ~ 130 ms later, there is a dip in the average activities of the rest of the recorded neurons (bottom panel of Fig. 3 (a)). Further-more, there is a continuous ramp-up of activities following the dip. This trend is also evident at the level of individual subjects (Fig. 3 (b)), and qualitatively agrees with the prediction of our hierarchical working memory model. As a control, within the same recorded population we label the subset of neurons that respond to the onset of the movie clip as “onset” neurons. Aligning firing rates to the movie onset, we observe a peak in the on-set neurons as reported in [41]; however, unlike Fig. 3 (a), the remaining neurons (including boundary neurons) do not exhibit the dip-then-ramp pattern (Fig. 3 (c)). This indicates that the dip and ramp-up are specific to boundary neurons and suggests an internal network mechanism rather than simple inhibitory feedback or statistical artifacts.

Cognitive boundary neurons in the medial temporal lobe.

(a) Average firing rate from single-neuron recording data in [41]. The mean z-score firing rates are plotted in solid lines, with one standard deviation included as the shades. Firing rates are averaged over all subjects and trials, and the relative time zero is chosen to be the location of the movie cut. Two qualitative features in the firing rates of the non-boundary neurons: a dip followed by a ramp, are predicted by the hierarchical working memory model. Top: Boundary neurons. Bottom: Non-boundary neurons. (b) Average firing rates of non-boundary neurons over all trials for individual subjects. Subjects are sorted based on the location of the dip. A trend similar to panel (a) is observed for each subject. For individual 2D plots, see Fig. S3. (c) Average firing rates of neurons aligned to the onset of the movie (relative time zero). After the peak in onset-specific neurons, the non-onset-specific neurons do not exhibit the dip-then-ramp pattern seen in panel (a). Top: onset-specific neurons. Bottom: Non-onset specific neurons.

Experimental tests of the new magic number

An important prediction of the hierarchical working memory model is the existence of an absolute limit M , beyond which perfect retrieval is impossible (Eq. (1)). One of the earliest studies to quantify this transition is the experiment performed by Miller and Selfridge [42] on the statistical approximation of language. In this experiment, the authors constructed n-gram approximations to English, where n refers to coherent occurrences with the previous n− 1 words. For example, a 1-gram approximation would consist of words randomly chosen from a corpus. In a 2-gram approximation, each word would appear coherently with the previous word, but coherence for any sliding window of three words is not required. As n increases, the constructed text gradually approaches natural text. In [42], subjects were presented with verbal materials constructed from such n-gram approximations and asked to recall the words. The fraction of recalled words f decreases with the length L of the material and increases with the degree of approximation n (Fig. 4(a) inset). Here, we are interested in the critical length Lc beyond which retrieval begins to be imperfect, i.e., f (Lc) = 1. Since the defining feature of working memory is the ability to perfectly retrieve items that are sustained in the memory, Lc is a measure of working memory capacity.

The new magic number bounds perfect-recall performance on verbal memory.

(a) Fraction of recalled words as a function of the length of the presented text. Different shades of blue correspond to different n-gram approximations. Black color represents natural text. Inset: Original data as presented in [42]. Main: Different n-gram approximation curves become straight lines in a semi-log plot and can be collapsed into a single universal curve (red dashed line) by adjusting the offsets on the individual intercepts. (b Critical length of perfect recall as a function of n-gram approximations. The location of the critical length Lc is determined by extrapolating the individual n-gram approximation curves to where f (Lc) = 1 using the universal slope. Different colored lines represent experiments in different languages. The grey dashed line corresponds to M = 2C−1 for C = 4.

In [42], the fraction of reported words for smallest stimulus length was less than one. To estimate Lc, we replotted the data from [42] in a semi-log plot (with f as a function of log2 L) and observed that all the different n-gram curves are well approximated by straight lines. We hence collapsed all the curves into a common line by adjusting the individual intercepts (red dashed line in Fig. 4(a)). We then used the slope of this line to extrapolate each n-gram approximation curve to its critical length Lc. We plot Lc as a function of n in Fig. 4(b). Lc increases with n as expected but starts to plateau around n = 4, saturating at roughly the predicted value of 8. Note that n = 0 corresponds to words randomly chosen from a dictionary, and is dominated by rare words many of which may not be familiar to the subjects. Therefore, the capacity for n = 0 is expected to be lower than that of common words as in the case for n = 1. The same analysis of two replicates of the Miller-Selfridge experiment in Danish and Hindi [46, 47] reveals similar trends. As n increases, the verbal material becomes more structured, which allows for the construction of hierarchical representations. Naively, one might expect that the number of perfectly recalled items Lc would continue to increase with n, as more structured materials are generally easier to remember. However, we observe that the performance plateaus around n ~ 4. This may be due to the fact that longer sentences need to be broken into smaller chunks to be stored in working memory, and there exists an optimal chunk size beyond which storage becomes inefficient and no longer improves memory. This observation qualitatively agrees with our theory in Eq. (1), and the value n ~ 4 at which capacity saturates could correspond to the size of a meta-chunk in the optimal hierarchical scheme illustrated above. Furthermore, our prediction that natural texts are chunked into pairs of two meaningful words resembles the empirical observation of collocations in language, such as adjective-noun, verb-noun, and subject-verb pairs, etc [48–51].

Notably, in Fig. 4(b) for all three languages, Lc saturates within the region predicted by M = 2C−1, when substituting for C = 4 [4]. Therefore, we conclude that the recall performance of verbal materials from working memory agrees with the prediction of our new magic number.

Chunking is classically believed to be a crucial process for overcoming extremely limited working memory capacity. In the current contribution, we suggest a simple mechanism of chunking in the context of the synaptic theory of working memory. The proposed mechanism relies on the ability of the system to temporarily suppress groups of items without permanently erasing them from WM, which is enabled by the longer-term form of synaptic facilitation, called synaptic augmentation. For chunking to work properly in the model, the system has to utilize separate neuronal clusters, which we call “chunking clusters” that effectively combine groups of several items each into distinct chunks. Moreover, the activity of chunking clusters has to be controlled in order to allow the suppression and reactivation of subsequent chunks at the right times to avoid saturating working memory capacity at any given moment. In particular, each chunking cluster has to be activated right after all of the corresponding stimuli are presented and later suppressed for them to be retrieved. Our model has no explicit mechanisms for this hypothesized control of chunking clusters; we speculate that it could be triggered by corresponding cues, e.g., chunking clusters could be activated by extra temporal pausing or intonation accentuation, and suppressed by internally generated retrieval signals. While further experimental and theoretical studies are needed to elucidate these suggestions, the existence of specialized chunking neurons has some recent neurophysiological support in electrical recordings in epileptic patients, where neurons responding to cuts in video clips were identified. We analyzed the data collected in these experiments and found that the activity of these and other neurons during clip watching is broadly consistent with our model predictions.

Apart from proposing the biological mechanism of chunking in working memory, we considered the question of whether the hierarchical organization of items in working memory could emerge from the subsequent chunking of chunks. Indeed, we demonstrated that the model allows for such a hierarchical scheme; however, due to working memory capacity, the overall number of items that can be retrieved is still constrained even for the optimal chunking scheme. We derived the universal relation between capacity and the maximal number of retrievable items, which we call a magic number following the classical Miller paper [14]. In particular, this relation predicts the new magic number of 8 for a working memory capacity of 4, which is currently accepted as the best estimate of capacity. The chunking scheme achieving this limit corresponds to dividing the inputs into 4 chunks of 2, with two “meta-chunks”, each consisting of two chunks. We reanalyzed the results of a memory study where subjects were presented with progressively higher-order approximations of meaningful passages for recall, and found that indeed the average maximal number of words that could be fully recalled was close to the predicted value of 8, and that this number saturated for a 4th order approximation of meaningful passages, corresponding to the size of a “meta-chunk” in the optimal chunking scheme predicted by the model. While encouraging, more studies should be performed to elaborate on this issue, in particular to more directly demonstrate the ability of subjects to form chunks of chunks during working memory tasks.

Our theory and the proposed neural network mechanism attempt to bridge the microscopic level of neural activities and the macroscopic level of behaviors in the context of hierarchically-structured memories. Our analytical results and data analysis methods offer new perspectives on classical results in cognitive neuroscience and psychology. The proposal of a hierarchical structure in working memory can open many new directions. For instance, long-term memory is usually organized in a hierarchical manner, as reflected in our ability to gradually zoom into increasingly fine details of an event during recall [52]. While working memory underlies our ability to construct such hierarchical representations, little is known about how the transient tree-like structure in working memory is related to the hierarchy in long-term memory. Furthermore, one of the hallmarks of fluid intelligence — the ability to compress and summarize information — is also related to re-coding information in a hierarchical manner [53]. Understanding how our mind is capable of making use of hierarchical structures for complex cognitive functions such as summarization and comprehension remains an important open question.

A. RNN model for hierarchical working memory

As illustrated in Fig. 1(a), the recurrent network that implements WM has 3 functionally distinct types of neuronal populations: stimulus clusters that encode different items (indexed by i below), chunking clusters (indexed by m), and a single inhibitory neural pool indexed by I. WM implementation is based on the previously introduced synaptic theory of working memory [24, 26, 28]. All stimulus and chunking clusters exhibit short-term synaptic plasticity in the recurrent self-connections, such that the instantaneous strength of connections for cluster µ (µ = (i, m)) is given by

where A is the amplitude of the recurrent strength, u is the probability of release, and x is the fraction of available neurotransmitters; all three factors depend on time via the following dynamical equations reflecting different STP processes:

where Rµ is the activity of cluster µ; U is the baseline value of release probability; τf, τd and τA are time constants of synaptic facilitation, depression and augmentation, correspondingly; Amin, Amax and κA are parameters of synaptic augmentation that distinguish this model from earlier versions. Apart from self-connections, each stimulus and chunking cluster is reciprocally connected to the inhibitory pool, and some of the chunking clusters develop quick inhibition on groups of stimulus clusters as explained below. The activity of each cluster is determined as a non-linear gain function of its input, and all inputs satisfy the following standard dynamics:

where R(h) = α ln(1+exp(h/α)) is a soft threshold-linear gain function mentioned above. Ib stands for external background inputs from other regions of the brain that reflect the general level of activity in the network, and Ie is the external input used to load memory stimuli. wEI and wIE define the strength of feedback inhibition between stimulus and chunk clusters and the global inhibitory cluster. Furthermore, we assume that when a chunking cluster m gets activated by a chunking cue at tc during the presentation, the weak inhibitory synapses are selectively strengthened between the chunking cluster and the stimulus clusters i in the same chunk presented before it:

See Fig. S1 for an illustration of the synaptic matrix before and after chunking. For the hierarchical structure in Fig. 2(b), we generalize Eq. (8) to higher-level chunking clusters, such that the kth level chunking clusters inhibit all the lower-level clusters presented before them (both chunking and stimulus).

The detailed synaptic mechanism for behavioral time scale plasticity such as Eq. (8) is subject to much active research [54–58]. Here in the RNN model, we do not attempt to explain its mechanism but rather assume that it takes place via external control. The microscopic implementation of Eq. (8) is not crucial to the proposed chunking mechanism, and in Methods Sec. D, we present additional RNN simulations that adopt a possible implementation of Eq. (8) and achieve similar activity traces as in Fig. 1(d) and Fig. 2(b).

B. The new magic number

At any given moment, the network cannot maintain more than C active clusters (Fig. 1(d) top panel illustrates the case of C = 4), and we refer to C as the basic working memory capacity. Even though we can potentially encode an arbitrarily deep hierarchical representation, C nevertheless constrains how many stimulus clusters can be retrieved. To understand the consequence of this constraint, we abstract away from the recurrent neural network and consider the effective hierarchical representation entailed by its activity (Fig. 2(a)).

Let us denote the size of the mth chunk at the kth level (1 ≤k≤ K) as ckm, which is the same as the branching ratio of its parent level. For example, the effective treelike hierarchical structure in Fig. 2(a) has four chunks of two stimulus clusters at the k = 3 level. It proves to be instructive to first consider a slightly simplified setting, where at a given level k all the chunk sizes are the same horizontally, ckm = ck for all chunks m (e.g., c2m = 2 for all four of the k = 3 level chunks in Fig. 2(a)). Later, we will relax this assumption and show that the result we derive below still holds.

To retrieve a chunk from the bottom of the hierarchy, i.e., the stimulus clusters that encode actual memories, we need to suppress nodes upstream of the desired chunk. As a result, children of the suppressed node will become reactivated. A series of suppressions from the top to the bottom of the hierarchy requires the working memory to simultaneously maintain cK stimulus chunks from the bottom level, as well as ck 1 chunking clusters from each of the kth level above (1≤ k < K) that were not suppressed but become active due to the suppression of their parent. However, the total number of clusters that can be maintained must not exceed C (e.g., the total number of clusters enclosed by the blue dashed circles in Fig. 2(a) should not exceed 4),

Meanwhile, the total number of stimulus clusters encoded in the hierarchical structure is

To achieve maximum capacity, we maximize Eq. (10) subject to the constraint in Eq. (9). Using the arithmetic and geometric mean inequality, we arrive at

where the equality is saturated when the branching ratio (chunk size) ck at all levels are equal,

We notice that Mc(K) monotonically increases with K. Since the chunk size considered here ck needs to be an integer, we have the optimal level K and optimal branching ratio c

Substituting Eq. (13) into Eq. (11), we arrive at the ca-pacity

Next, let us consider relaxing the simplifying assumption of ckm = ck. Without loss of generality, suppose that at the kth level, ck m> ckm+1. In order to retrieve the mthchunk at this level, the WM needs to at least maintainckm clusters, which implies that when trying to retrieve the (m + 1)th chunk the WM is not saturated because all the levels above the kth are identical for the mth and (m + 1)th chunk. This is sub-optimal since our goal is to maximize M. Therefore, ckm+1 can be increased to at least as large as ckm. The same logic can be applied recursively to all levels of the hierarchy, which demands that the optimal hierarchical structure for maximum M has ckm = ck, so we again arrive at M in Eq. (14).

C. RNN simulations

Activity traces of all the dynamical variables in Eq. (3)-(8) are shown in the Fig. S1. In particular, the synaptic matrix Jµv before and after chunking in Fig. 1(d) is shown for comparison. All simulation parameters are reported in Table I. All the external inputs Ie used for loading the memories are rectangular functions with support only at the presentation time, and have an amplitude of 750 Hz, and all the background input Ib has amplitude of |Ib| = 10 Hz. Additionally, the timing of the external control signals are summarized in below.

Fig. 1(d) top panel: Stimulus starts to load at t = 1 s for a duration of 0.025 s with an interval of 0.45 s. Background input Ib has a constant value of 10 Hz.

Fig. 1(d) bottom panel: Stimulus starts to load at t = 1 s for a duration of 0.025 s with an interval of 0.45 s. Chunking clusters are loaded for a duration of 0.025 s with an interval of 0.3 s. Background input Ib has a constant value of 10 Hz during the presentation stage and switches between 10 Hz and 10 Hz for a duration of 1.35 s during the retrieval stage.

Fig. 2(b): The k = 3 level stimulus clusters start to load at t = 1 s for a duration of 0.15 s with an interval of 0.45 s. k = 2, 3 level chunking clusters load for a duration of 0.01 s with an interval of 0.2 s. Background input Ib has a constant value of 10 Hz during the presentation stage and switches between 10 Hz and 10 Hz for a duration of 0.8 s during the retrieval stage.

D. Additional simulations

Eq. (8) assumes that chunking clusters can quickly bind with the stimulus clusters from the same chunk. For such binding to be selective, the synapse of the stimulus clusters need to be able to maintain a memory trace of its past activities. In this section, we attempt to provide a possible mechanism. We assume that there is a time-delayed Hebbian-like strengthening on the inhibitory synapses from the chunking clusters to the stimulus clusters. Such strengthening integrates back in time over a window τs (τfτsτA) for stimulus clusters that were presented before the activation of the chunking cluster, and strengthens the originally present but weak synapses between them. Given a stimulus cluster i presented within τs before the chunking cluster m, the strength of the inhibitory synapses Jim between them gets strengthened according to

where represents a time-delayed synapse that maintains memory traces over a window τs, σ(·) is a non-linear function chosen to be the same as the gain function for the firing rates, and θ0 is a threshold that filters out reactivations.

We expect Eq. (15) to work in the regime where the external input to the network during presentation is much stronger than the subsequent reactivations, which is typically the case. Here, the reactivations are filtered out so that they do not contribute to the binding process and form cross-linking between different chunks. Eq. (15) only strengthens the binding between the chunking cluster m and the stimulus cluster i that were presented within the τs time window, but not the stimuli that were presented outside of τs but reactivate during τs, which have much weaker amplitudes. As a result, the time-delayed augmentation effectively binds the chunking cluster with the stimulus clusters presented before it within τs. Time-delayed synapses were first introduced in the context of memory sequences [59–61], and are found to be related to behavioral time scale synaptic plasticity through dendritic computation [56, 57, 62].

As a potential detailed mechanism of Eq. (8), we perform additional RNN simulations with Eq. (15). We find that Eq. (6)-(3) with Eq. (15), instead of Eq. (8), is able to approximate the activity traces as in Fig. 1(d) and Fig. 2(b) (see Fig. S2). However, it requires fine-tuning between the presentation time and the integration window τs, as well as the threshold θ0. We report the additional parameters used in Eq. (15) below.

Parameters that are independent of the presentation times: τJ = 75 s, Jmin = 0, Jmax = 10, κJ = 1 Hz. Parameters that depend on the presentation times: Fig. S2) (a)-(b): τs = 1.8 s and θ0 = 7000 Hz. Fig. S2) (c): threshold θ0 is chosen to be proportional to the duration of the loading time with the external input: θ0 = 25600 Hz for J (2)(3) and J (1)(3), but for J (1)(2) is reduced by a factor of five, where we use J (k)(l) to denote synaptic matrix components that correspond to the inhibition from level k to l; Integration window τs is chosen such that adjacent levels are shorter than skip levels: τs = 1.9 s for adjacent levels (k = 1 to k = 2 and k = 2 to k = 3) and τs = 3.1 s for skip level (k = 1 to k = 3).

E. Cognitive boundary neurons

Two types of boundary neurons are reported in [41]: neurons that code for soft boundaries (change of camera position after the cut) and neurons that code for hard boundaries (change of movie content after the cut). In the present study, we do not distinguish between the two types of neurons and classify both as boundary neurons. In Fig. 3, we pool together the raw firing rates of all the boundary (or non-boundary) neurons from a subject, then perform the z-score averaging across different subjects. We have excluded four subjects out of eighteen in [41] from our analysis, because in those subjects either no neurons responding to the onset of the movie were detected, or no neurons responding to the onset of the cut were detected. The z-score firing rates of non-boundary neurons from individual subjects are shown in the Fig. S3. Data analyzed in Fig. 3 is downloaded from the DANDI Archive at https://dandiarchive.org/dandiset/000207/0.220216.0323.

Full activity trace of the bottom panel in Fig. 1(d).

(a) Activity traces of all variables. From top to bottom: firing rates Rµ, background input currents, release probability uµ, fraction of available neurotransmitters xµ, amplitude of the recurrent strength Aµ, and effective recurrent self-connection strength (b) Snapshots of the synaptic matrix Jµv before and after chunking. Clusters 1-14 are stimulus clusters and 15-16 are chunking clusters. At t = 2 s, only the first chunk (the blue colors) is presented, chunking clusters are not activated, and only the recurrent self-connections are nonzero in the synaptic matrix. At t = 4.5 s, both chunks are formed by the chunking connections (dark blue colors in the top right corner).

Additional RNN simulations with delayed Hebbian plasticity.

(a) Approximating the chunking dynamics in Fig. 1(d) using Eq. (15) instead of Eq. (8). Top: activity traces of the firing rates. Bottom: activity traces of the inhibitory connections from chunking clusters to stimulus clusters JSC. (b) Snapshot of the synaptic matrix after chunking, resulting from the dynamics described in Eq. (15). (c) Approximating the chunking dynamics in Fig. 2(b) using Eq. (15) instead of Eq. (8). Synaptic matrix components that correspond to the inhibition from level k to l are collectively denoted as J (k)(l). First three panels: firing rate activity traces of the clusters in Fig. 2(a). Fourth and fifth panels: inhibitory connections between adjacent levels J (1)(2) and J (2)(3), inhibitory connections between skip levels J (1)(3), resulting from the dynamics described in Eq. (15).

Individual 2D plots of Fig. 3(b).

Individual subjects’ z-score firing rates of the non-boundary neurons are shown in blue, with one standard deviation included as shades. Black dashed lines denote t = 0 s where the movie cut occurs. Red dashed lines denote the location of the maximum firing rate of the boundary neurons. Results are pooled from the raw firing rates of all non-boundary neurons from that subject. Subject IDs are presented according to the data in [41]. While some subjects do not exhibit the qualitative trend as predicted (e.g., the firing rate of subject P64CS does not have a ramp, and TWH120 does not have a dip), most of the subjects’ firing rates follow the same qualitative trend as observed in the average plot in Fig. 3(a).

Continue Reading