Google has introduced Gemma 4 12B, a new multimodal model the company says is designed to run locally on laptops with 16GB of memory. The release is positioned as a middle ground between smaller edge-focused models and Google’s larger 26B mixture-of-experts system.

The company says Gemma 4 12B is built to bring multimodal intelligence to consumer hardware without requiring cloud access. It can handle vision and audio inputs natively, and Google says it is the first mid-sized Gemma model with native audio support. The model is also intended for agentic workflows, meaning it can support more complex, multi-step tasks.

A unified architecture without separate encoders

A central feature of Gemma 4 12B is its encoder-free design. In many multimodal systems, images and audio are processed by separate encoder components before being passed to the language model. Google says it removed those extra steps in this model, routing vision and audio directly into the LLM backbone instead.

For visual input, the company says it replaced the usual vision encoder with a much simpler embedding module. For audio, it removed the encoder entirely and projected the raw signal into the same space used by text tokens. Google says this approach reduces latency and memory use while preserving performance.

The company claims the model’s benchmark results are close to those of its larger 26B model, while requiring less than half the memory footprint. That makes it suitable for local use on machines with limited resources, according to Google.

Built for local use and developer access

Google says Gemma 4 12B can run on consumer laptops with 16GB of RAM or unified memory, making it easier for developers to test multimodal applications offline. The company also highlighted support for faster inference through Multi-Token Prediction drafters, which are designed to lower latency.

The model is being released under the Apache 2.0 license, which makes it broadly accessible for developers and businesses that want to build with it. Google also said the Gemma 4 family has now passed 150 million downloads, a sign of continued interest in the open model line.

To help adoption, Google is making Gemma 4 12B available through a range of tools and platforms, including LM Studio, Ollama, Google AI Edge Gallery, the Google AI Edge Eloquent app, LiteRT-LM CLI, Hugging Face, Kaggle, llama.cpp, MLX, SGLang, and vLLM. Developers can also fine-tune the model using Unsloth.

Focus on agents and offline audio tasks

Google is also promoting Gemma 4 12B for agent development. Alongside the model, the company is launching a Gemma Skills repository meant to help agents use the latest Gemma capabilities.

In a demonstration, Google showed the model transcribing, formatting, and translating voice inputs entirely offline using the AI Edge Eloquent app. That example underscores the company’s push to make advanced AI work on-device rather than relying on remote servers.

The release comes as Google continues to expand the Gemma lineup for developers who want lighter-weight models that still offer modern multimodal features. With Gemma 4 12B, the company is aiming to make local AI systems more capable, especially for users who want to run vision and audio workloads directly on a laptop.