tilde-research opens source Wall Attention for long-context reasoning

tilde-research releases Wall Attention kernels for long-context models

tilde-research has published Wall Attention, an open-source attention variant designed to help models handle long context more efficiently. The release includes code for both training and decoding, along with tests and documentation in a GitHub repository.

At the core of the project is a new way of computing attention scores. Instead of applying the same decay or positional effect across the whole hidden dimension, Wall Attention assigns each channel its own learned multiplicative decay across time. The idea is to give every query channel an independent forgetting rate that depends on content, while still preserving the basic softmax attention framework. According to the project documentation, setting the decay term to zero reduces the method to standard attention.

The repository packages two main kernels. One is used for training and prefill, where the model processes a prompt and computes forward and backward passes in a fused Triton kernel. The other is a decode kernel for step-by-step generation. That kernel works with a pre-rescaled key-value cache so the model does not need to repeatedly recompute the full prefix during autoregressive inference.

The project’s README says the decode path is intended to make per-token generation cheaper by using a small pass over cached data. It also describes a chunked caching scheme that keeps the decay values numerically stable over long sequences by anchoring them per chunk. The release notes say this design is meant to support long-context reasoning while controlling the cost of inference.

What the release includes

The codebase is organized into a public API, training kernels, a decode implementation, and a PyTorch reference version for correctness checks. The documentation also points to a blog post for more background on the method.

Wall Attention supports grouped-query attention, where the number of query heads can exceed the number of key-value heads. It also includes optional features such as a scalar gate, an attention sink bias, a sliding window, and variable-length sequence packing. The project says the kernels accept bf16 and fp32 inputs and use autotuned block sizes on supported NVIDIA GPU architectures.

The repository is relatively compact but appears to be built for practical use. It includes installation instructions, example code for both training and cached generation, and a test suite. The tests are described as checking kernel outputs against an eager reference implementation, as well as validating gradients with finite-difference checks. The decode kernel is also tested against the training forward path in a streaming generation setup.

The latest commit listed in the repository history was made on June 3, 2026 and is described as an improvement to kernel stability. A prior commit packaged the kernels for release.

Broader significance

Attention mechanisms remain central to large language models, but long-context workloads continue to create pressure on memory and compute. By introducing per-channel decay and a cache strategy aimed at avoiding repeated prefix work, tilde-research is targeting one of the main bottlenecks in long-sequence inference.

The project builds on prior work from flash-linear-attention, which the maintainers acknowledge as an influence on the parallel-attention machinery used in the kernels. The release is published under the MIT license.

For researchers and engineers working on sequence modeling, the new repository offers a concrete implementation to inspect, benchmark, and potentially adapt. The larger question is whether the method will translate into meaningful gains across real-world long-context tasks. The release provides the tooling, but adoption will depend on performance, stability, and compatibility with existing model architectures.