Google unveils DiffusionGemma, an experimental open model aimed at faster text generation

Google has introduced DiffusionGemma, an experimental open model that it says can generate text up to four times faster than conventional approaches when run on dedicated GPUs. The company released the model under an Apache 2.0 license and positioned it as a research-focused system for developers who need low-latency, interactive local AI workflows.

A different approach to text generation

Unlike standard large language models that produce output one token at a time, DiffusionGemma uses a text diffusion method that drafts blocks of text in parallel. Google says that approach lets the model generate 256 tokens at once and helps shift the bottleneck from memory bandwidth to compute. In the company’s view, that makes better use of local hardware than sequential decoding when only a single user is making requests.

The model is a 26 billion parameter mixture-of-experts system, but Google says only 3.8 billion parameters are active during inference. That design, combined with quantization, is meant to keep the model within the memory limits of higher-end consumer GPUs. Google says the model can run within 18GB of VRAM when quantized.

Google described performance figures that include more than 1,000 tokens per second on a single NVIDIA H100 and more than 700 tokens per second on an NVIDIA GeForce RTX 5090. The company also said it worked with NVIDIA to optimize the model across consumer and enterprise hardware, including systems based on Hopper and Blackwell architectures.

Built for speed-critical workflows

Google said DiffusionGemma is intended for developers and researchers exploring use cases where responsiveness matters more than maximum output quality. Examples include inline editing, rapid iteration, code infilling, amino acid sequence generation and mathematical graph tasks. Because each forward pass considers the whole block of text, the model can also make use of bi-directional attention, which Google says can be helpful in non-linear text problems.

The company also emphasized that DiffusionGemma iteratively revises its own output. In practice, that means it starts with placeholder tokens and refines them over several passes until the text converges. Google said this can help with self-correction and with formatting-heavy outputs such as markdown, where the model may be better able to close structures cleanly.

Still, Google is not presenting the model as a replacement for its standard Gemma 4 lineup. It says the autoregressive Gemma 4 models remain the preferred option for production use cases that prioritize output quality. DiffusionGemma, by contrast, is framed as an experimental system for situations where speed is the main concern.

Open release with developer tooling

Alongside the model release, Google said the weights are available on Hugging Face and that developers can work with DiffusionGemma through tools including MLX, vLLM and Hugging Face Transformers. The company also pointed to fine-tuning options using Hackable Diffusion, Unsloth and NVIDIA NeMo, with support for llama.cpp expected later.

Google highlighted one example of fine-tuning in which a version of the model was adapted to solve Sudoku, a task that can be difficult for autoregressive models because future tokens influence earlier decisions. The company also noted that the model’s speed advantages are most useful at low to medium batch sizes on a single accelerator, while cloud serving with high concurrency may reduce the benefit.

The release continues Google’s recent push around the Gemma family, but with a different technical emphasis. Instead of chasing the highest-quality responses, DiffusionGemma is an attempt to show how diffusion techniques can be used to make text generation much faster on local hardware.