Google adds diffusion-style text generation to Gemma with DiffusionGemma

Google has unveiled DiffusionGemma, an experimental model that brings diffusion-style generation to the Gemma family and aims to speed up text output by changing how tokens are produced. The company says the model is built on Gemma 4 and is intended to help developers understand, serve, and customize a new approach to language generation.

According to Google, DiffusionGemma is designed to reduce one of the main constraints in traditional large language models: memory bandwidth. Rather than generating text one token at a time from left to right, the model works on a 256-token canvas in parallel. Google says this shifts the bottleneck toward compute and can deliver much faster token generation on GPUs, including more than 700 tokens per second on an NVIDIA GeForce RTX 5090 and more than 1,000 tokens per second on a single NVIDIA H100.

How the model works

The company describes DiffusionGemma as using bidirectional attention during generation, which allows the model to evaluate the entire canvas at once instead of only looking backward. Google says this setup gives the model a form of self-correction, since earlier outputs can be refined across multiple denoising passes. For longer outputs, the model uses a block autoregressive method. Once a 256-token block is fully refined, it is committed to the key-value cache and the model moves on to the next block.

Google says the architecture keeps some advantages of autoregressive models while adding the parallelism of diffusion. It also notes that the same structure as Gemma 4 26B A4B should make deployment simpler, since developers only need to add a denoising step when integrating it into serving frameworks.

The model is presented as a 26 billion parameter mixture-of-experts system that activates 3.8 billion parameters during inference. Google says that makes quantized deployment possible within 18 GB of VRAM.

Developer tools and deployment

Alongside the model announcement, Google published a developer guide and said it worked with the vLLM team to support DiffusionGemma in that serving framework. The company says the model can be deployed through vLLM's standard OpenAI-compatible local server.

Google is also making the experimental weights available on Hugging Face under the Apache 2.0 license. In addition, the company points developers to training recipes, documentation, and integration options across tools including Hugging Face Transformers, SGLang, MLX, Unsloth, and NVIDIA NeMo. Deployment options also include Google Cloud's Model Garden and NVIDIA NIM.

Sudoku as a proof of concept

To show how the model can be adapted, Google highlighted a Sudoku example. The company says traditional autoregressive models can struggle with puzzles that require global constraint checking, since they generate output sequentially. DiffusionGemma, by contrast, can examine the whole board during denoising.

Google said it released a fine-tuning recipe using Hackable Diffusion, a JAX-based research toolkit, and reported that a simple supervised fine-tuning setup improved Sudoku accuracy from roughly zero success with the base model to 80% while also reducing the number of inference steps.

The release signals Google's continued push to experiment with non-autoregressive text generation methods within the Gemma line. For developers, the pitch is not only faster output, but also a different way to handle long-context generation and tasks that benefit from global consistency.