Google adds quantization-aware training checkpoints to Gemma 4 for smaller, faster local deployment

Google expands Gemma 4 with efficiency-focused checkpoints

Google has released new quantization-aware training, or QAT, checkpoints for its Gemma 4 model family, aiming to make the models easier to run on phones, laptops, and consumer GPUs. The company said the update is designed to reduce memory requirements while preserving model quality and improving on-device performance.

The new checkpoints arrive about two months after Gemma 4 first launched, and follow a series of recent additions to the lineup. Google has already introduced multi-token prediction to speed up inference and added a 12B model to fill the gap between its smaller and larger variants. The latest release focuses on making the models more compact and practical for local deployment.

QAT differs from standard post-training quantization by incorporating quantization during the training process rather than applying it afterward. Google said this approach helps reduce the quality loss that can appear when models are compressed. In its testing, the company said QAT produced better results than standard PTQ baselines.

The release includes checkpoints for the widely used Q4_0 quantization format, along with a new format tailored for mobile use cases. Google said the mobile-oriented version reduces the memory footprint of the Gemma 4 E2B model to 1GB. For the text-only version of that model, the footprint drops to less than 1GB, depending on the configuration.

A custom approach for mobile hardware

Google said it designed the mobile format with edge devices in mind, including techniques intended to better match mobile processors and accelerators. The company highlighted several elements of the system, including precomputed activation scaling to reduce on-device overhead, channel-wise quantization for better hardware alignment, and targeted 2-bit compression for the token-generation parts of the model.

It also said it optimized the embedding and key-value cache components, which can play an important role in keeping long conversations efficient without exhausting memory. According to Google, these changes can significantly lower the active memory needed during use.

The company also noted that users can reduce memory use further by omitting modalities they do not need. That means developers can deploy only the text components, or choose the audio and vision encoders separately depending on the application.

Broad support across developer tools

To make the models easier to use, Google said the Gemma 4 QAT checkpoints are being supported across a range of tools and runtimes. The company pointed developers to model weights on Hugging Face in Q4_0 and mobile collections, with formats prepared for llama.cpp and vLLM. It also said unquantized checkpoints are available for workflows that require conversion before quantization.

Google listed several deployment options, including llama.cpp, Ollama, LM Studio, LiteRT-LM for edge devices, and Transformers.js for running models in the browser. It also mentioned support for larger-model serving through SGLang and vLLM, Apple Silicon optimization with MLX, and fine-tuning through Hugging Face Transformers and Unsloth.

The release extends Google’s push to position Gemma 4 as a family that can run across a wider range of hardware, not just server-class systems. For developers building local or edge-focused applications, the new QAT checkpoints are intended to make that easier without a major tradeoff in output quality.