MiniMax open-sources sparse attention kernels for million-token context workloads

MiniMax has published a new open-source project that brings sparse attention to long-context inference on NVIDIA SM100 GPUs. The repository, called MiniMax Sparse Attention, combines dense FlashAttention support with sparse top-k attention kernels designed to handle very large token windows more efficiently.

The release is centered on a Python package named fmha_sm100, which exposes two JIT-compiled stacks. One stack provides dense attention operations, including a FlashAttention-style path and a block selector for sparse layouts. The other uses CuTe-DSL to compile full sparse attention routines at runtime, covering both forward attention and paged decode paths. According to the project documentation, the code is intended for million-token class contexts.

MiniMax describes the system as supporting a sparse prefill workflow. In that path, the model first runs a dense proxy attention pass to produce per-block maximum scores. Those scores are then fed into a top-k selector that chooses the most relevant key-value blocks, which are used in a second sparse attention pass. The repository also includes an adapter that can route the existing fmha_sm100 API into sparse attention functions when needed.

The codebase is organized around a few main components. The csrc directory contains JIT-compiled kernels generated from templates at runtime, while the cute directory holds the CuTe-DSL implementation. The project also ships benchmarks, tests, scripts, and documentation, including a paper PDF. MiniMax says the work is released under the MIT license, with notices for any bundled third-party code.

The repository currently targets a narrow hardware and software stack. Requirements listed in the docs include NVIDIA SM100 GPUs, CUDA with nvcc, Python 3.10 or newer, and Linux on x86_64. The README notes that the kernels can also be accessed through Hugging Face's kernels library, which allows users to load the MiniMax kernel package directly.

Installation is available through a standard pip install after cloning the repository, and the project recommends pulling submodules recursively so the CUTLASS headers are present for compilation. Because the kernels are JIT-compiled, the first import or first test run may take longer than later executions. The documentation says a smoke test can take from about 30 seconds to several minutes the first time, depending on cache state.

MiniMax's example usage shows how the sparse system can be used in two steps. First, a dense proxy pass computes scores. Then, a top-k selection step chooses the most relevant blocks before the sparse attention kernel runs with the chosen indexes. The project also indicates support for multiple data types in its sparse attention stack, including BF16, FP8, NVFP4, and FP4 for certain paths.

The launch adds another open-source option for teams working on long-context model performance, an area that has become increasingly important as developers push models toward longer prompts and larger retrieval windows. While the project is tightly coupled to NVIDIA's latest SM100 hardware, it gives researchers and engineers a concrete implementation to study, test, and potentially adapt for sparse long-context workloads.