FlashMemory opens its DeepSeek-V4 retriever to the public

FlashMemory has released an open-source retriever designed to reduce the memory footprint of DeepSeek-V4’s compressed-sparse-attention key-value cache. The project, published on GitHub under the repository FlashMemory-Deepseek-V4, is meant to decide which cache chunks should stay on the GPU during decoding and which can be moved off device.

The retriever is described as a lightweight model that looks at the hidden state of a decode token and predicts which cache chunks the next roughly 64 tokens are likely to use. Based on that prediction, it keeps only the most relevant chunks resident on the GPU. The rest can be shifted to CPU or disk storage, lowering on-device memory requirements.

According to the repository, the approach can preserve performance while holding only about 10% to 15% of the KV cache on device. In downstream evaluation, FlashMemory says the system matches or outperforms a full-attention baseline on several long-context tasks.

What the release includes

The open-source package includes the retriever model itself, a small demo with mock inputs, and a toy sparse-decode reference script that illustrates how the retrieval process fits into decoding. Model weights are also available on Hugging Face.

The repository clarifies that the published checkpoint contains retriever weights, not the full DeepSeek-V4 model. The code is intended as a standalone retriever release and a reference implementation, rather than a complete production inference stack.

The project uses three internal CSA layers, labeled l10, l12 and l20, each with its own weights. During inference, their scores are combined per chunk using either a max or mean ensemble method. The default behavior uses max, which effectively keeps the union of chunks favored by the individual layers.

How the retriever works

FlashMemory’s retriever processes a decode token’s hidden state, then projects it through several layers before comparing it with compressed key chunks. The model applies a scoring function that produces a value between 0 and 1 for each chunk, indicating how likely it is to matter for future attention.

The compressed cache format uses 132 bytes per chunk. The first 128 bytes store quantized key values, while the remaining four bytes store a scale factor used for dequantization. The repository describes the system as using fp8-style values and a dequantization step to recover usable key vectors.

The release also documents the architecture in some detail, including the use of RoPE, a Hadamard transform and a small ensemble across three layers. Hyperparameters such as head count, head dimension and rotary embedding settings are listed in the repository for reproducibility.

Long-context results and limitations

FlashMemory reports that the retriever performs well on long-context workloads, including RULER, LongMemEval and LongBench V2. The repository says some tasks show parity with the full-attention baseline, while others show small gains. It also notes that especially precise needle-in-a-haystack style retrieval can require a threshold fallback in the serving layer.

The toy inference script is presented as a pedagogical example of the decode-time control flow. It does not implement the full production swap engine or real CPU-to-GPU cache movement. Instead, it simulates sparse attention by masking out unselected chunks.

FlashMemory is distributing the project under an MIT license. The GitHub page also includes a citation request for researchers who use the work in their own projects.