Researchers at Sakana AI, a Tokyo-based AI startup, have introduced a novel memory optimization system that enhances the efficiency of Transformer-based models, including large language models (LLMs).
The method, called Neural Attention Memory Models (NAMMs) available via full training code on GitHub, reduces memory usage by up to 75% while improving overall performance. By focusing on essential tokens and removing redundant information, NAMMs address one of the most resource-intensive challenges in modern AI: managing long context windows.
Transformer models, the backbone of LLMs, rely on “context windows” to process input data. These context windows store “key-value pairs” (KV cache) for every token in the input sequence.
As the window length grows—now reaching hundreds of thousands of tokens—the computational cost skyrockets. Prior solutions attempted to reduce this cost through manual token pruning or heuristic strategies but often degraded performance. NAMMs, however, use neural networks trained through evolutionary optimization to automate and refine the memory management process.
Memory Optimization with NAMMs
NAMMs analyze the attention values generated by Transformers to determine token importance. They process these values into spectrograms—frequency-based representations commonly used in audio and signal processing—to compress and extract key features of the attention patterns.
This information is then passed through a lightweight neural network that assigns a score to each token, deciding whether it should be retained or discarded.
Sakana AI highlights how evolutionary algorithms drive NAMMs’ success. Unlike traditional gradient-based methods, which are incompatible with binary decisions like “remember” or “forget,” evolutionary optimization iteratively tests and refines memory strategies to maximize downstream performance.
“Evolution inherently overcomes the non-differentiability of our memory management operations, which involve binary ‘remember’ or ‘forget’ outcomes,” the researchers explain.
Proven Results Across Benchmarks
To validate the performance and efficiency of Neural Attention Memory Models (NAMMs), Sakana AI conducted extensive testing on multiple industry-leading benchmarks designed to assess long-context processing and multi-task capabilities. The results underscored NAMMs’ ability to significantly improve performance while reducing memory requirements, proving their effectiveness across diverse evaluation frameworks.
On LongBench, a benchmark specifically created to measure the performance of models on long-context tasks, NAMMs achieved an 11% improvement in accuracy compared to the full-context baseline model. This improvement was achieved while reducing memory usage by 75%, highlighting the method’s efficiency in managing the key-value (KV) cache.
By intelligently pruning less relevant tokens, NAMMs allowed the model to focus on critical context without sacrificing results, making it ideal for scenarios requiring extended inputs, such as document analysis or long-form question-answering.
For InfiniteBench, a benchmark that pushes models to their limits with extremely long sequences—some exceeding 200,000 tokens—NAMMs demonstrated their ability to scale effectively.
While baseline models struggled with the computational demands of such lengthy inputs, NAMMs achieved a dramatic performance boost, increasing accuracy from 1.05% to 11.00%.
This result is particularly notable because it showcases NAMMs’ capacity to handle ultra-long contexts, a capability increasingly essential for applications like processing scientific literature, legal documents, or large code repositories where token input sizes are immense.
On Sakana AI’s own ChouBun benchmark, which evaluates long-context reasoning for Japanese-language tasks, NAMMs delivered a 15% improvement over the baseline. ChouBun addresses a gap in existing benchmarks, which tend to focus on English and Chinese languages, by testing models on extended Japanese text inputs.
The success of NAMMs on ChouBun highlights their versatility across languages and proves their robustness in handling non-English inputs—a key feature for global AI applications. NAMMs were able to efficiently retain context-specific content while discarding grammatical redundancies and less meaningful tokens, enabling the model to perform more effectively on tasks such as long-form summarization and comprehension in Japanese.
The results collectively demonstrate that NAMMs excel at optimizing memory usage without compromising accuracy. Whether evaluated on tasks requiring extremely long sequences or across non-English language contexts, NAMMs consistently outperform baseline models, achieving both computational efficiency and improved results.
This combination of memory savings and accuracy gains positions NAMMs as a great advancement for enterprise AI systems tasked with handling vast and complex inputs.
The results are particularly noteworthy compared to prior methods like H₂O and L2, which sacrificed performance for efficiency. NAMMs, on the other hand, achieve both.
“Our results demonstrate that NAMMs successfully provide consistent improvements across both performance and efficiency axes relative to baseline Transformers,” the researchers state.
Cross-Modal Applications: Beyond Language
One of the most impressive findings was NAMMs’ ability to transfer zero-shot to other tasks and input modalities.
One of the most remarkable aspects of Neural Attention Memory Models (NAMMs) is their ability to transfer seamlessly across different tasks and input modalities—beyond traditional language-based applications.
Unlike other memory optimization methods, which often require retraining or fine-tuning for each domain, NAMMs maintain their efficiency and performance benefits with no additional adjustments. Sakana AI’s experiments showcased this versatility in two key domains: computer vision and reinforcement learning, both of which present unique challenges for Transformer-based models.
In computer vision, NAMMs were evaluated using the Llava Next Video model, a Transformer designed for processing long video sequences. Videos inherently contain vast amounts of redundant data, such as repeated frames or minor variations that provide little additional information.
NAMMs automatically identified and discarded these redundant frames during inference, effectively compressing the context window without compromising the model’s ability to interpret the video content.
For instance, NAMMs retained frames with key visual details—such as action changes, object interactions, or critical events—while removing repetitive or static frames. This resulted in improved processing efficiency, allowing the model to focus on the most relevant visual elements, thus maintaining accuracy while reducing computational costs.
In reinforcement learning, NAMMs were applied to the Decision Transformer, a model designed to process sequences of actions, observations, and rewards to optimize decision-making tasks. Reinforcement learning tasks often involve long sequences of inputs with varying levels of relevance, where suboptimal or redundant actions can hinder performance.
NAMMs addressed this challenge by selectively removing tokens corresponding to inefficient actions and low-value information while retaining those critical to achieving better outcomes.
For example, in tasks like Hopper and Walker2d—which involve controlling virtual agents in continuous motion—NAMMs improved performance by over 9%. By filtering out suboptimal movements or unnecessary details, the Decision Transformer achieved more efficient and effective learning, focusing its computational power on decisions that maximized success in the task.
These results highlight NAMMs’ adaptability across vastly different domains. Whether processing video frames in vision models or optimizing action sequences in reinforcement learning, NAMMs demonstrated their ability to enhance performance, reduce resource usage, and maintain model accuracy—all without retraining.
NAMMs learn to forget almost exclusively parts of redundant video frames, rather than the language tokens describing the final prompt, the paper notes, highlighting the adaptability of NAMMs.
Technical Underpinnings of NAMMs
The efficiency and effectiveness of Neural Attention Memory Models (NAMMs) lie in their streamlined and systematic execution process, which enables precise token pruning without manual intervention. This process is built on three core components: attention spectrograms, feature compression, and automated scoring.
NAMMs dynamically adjust their behavior depending on task requirements and Transformer layer depth. Early layers prioritize “global” context like task descriptions, while deeper layers retain “local” task-specific details. In coding tasks, for instance, NAMMs discarded comments and boilerplate code; in natural language tasks, they eliminated grammatical redundancies while retaining key content.
This adaptive token retention ensures that models remain focused on relevant information throughout processing, improving speed and accuracy.
The first step involves generating Attention Spectrograms. Transformers calculate “attention values” at every layer to determine the relative importance of each token within the context window. NAMMs transform these attention values into frequency-based representations using the Short-Time Fourier Transform (STFT).
STFT is a widely used signal processing technique that breaks down a sequence into localized frequency components over time, providing a compact yet detailed representation of token importance. By applying STFT, NAMMs convert raw attention sequences into spectrogram-like data, enabling a clearer analysis of which tokens contribute meaningfully to the model’s output.
Next, Feature Compression is applied to reduce the dimensionality of the spectrogram data while preserving its essential characteristics. This is achieved using an exponential moving average (EMA), a mathematical method that compresses historical attention patterns into a compact, fixed-size summary. EMA ensures that the representations remain lightweight and manageable, allowing NAMMs to analyze long attention sequences efficiently while minimizing computational overhead.
The final step is Scoring and Pruning, where NAMMs use a lightweight neural network classifier to evaluate the compressed token representations and assign scores based on their importance. Tokens with scores below a defined threshold are pruned from the context window, effectively “forgetting” unhelpful or redundant details. This scoring mechanism enables NAMMs to prioritize critical tokens that contribute to the model’s decision-making process while discarding less relevant data.
What makes NAMMs particularly effective is their reliance on evolutionary optimization to refine this process. Traditional optimization methods like gradient descent struggle with non-differentiable tasks—such as deciding whether a token should be retained or discarded.
Instead, NAMMs use an iterative evolutionary algorithm, inspired by natural selection, to “mutate” and “select” the most efficient memory management strategies over time. Through repeated trials, the system evolves to prioritize essential tokens automatically, achieving a balance between performance and memory efficiency without requiring manual fine-tuning.
This streamlined execution—combining spectrogram-based token analysis, efficient compression, and automated pruning—allows NAMMs to deliver both significant memory savings and performance gains across diverse Transformer-based tasks. By reducing computational requirements while maintaining or improving accuracy, NAMMs set a new benchmark for efficient memory management in modern AI models.
What Comes Next for Transformers?
Sakana AI believes NAMMs are only the beginning. While current work focuses on optimizing pre-trained models at inference, future research may integrate NAMMs into the training process itself. This could enable models to learn memory management strategies natively, further extending the length of context windows and boosting efficiency across domains.
“This work has only begun to explore the design space of our memory models, which we anticipate may offer many new opportunities to advance future generations of transformers,” the team concludes.
NAMMs’ proven ability to scale performance, reduce costs, and adapt across modalities sets a new standard for the efficiency of large-scale AI models.