NVIDIA Brings NeMo AutoModel to Hugging Face for Faster MoE Fine-Tuning

NVIDIA has introduced NeMo AutoModel on Hugging Face, a library designed to speed up fine-tuning for Mixture-of-Experts models while keeping the familiar Transformers interface intact.

The company says the new approach builds on Hugging Face Transformers v5, which recently added first-class support for MoE architectures. NVIDIA's contribution adds its own performance layer through expert parallelism, fused all-to-all dispatch, and TransformerEngine kernels. According to NVIDIA, the result is substantially faster training and lower GPU memory use without requiring users to rewrite their existing code.

At the center of the release is a simple promise: users can keep the same from_pretrained() workflow and swap in a different import to access NeMo AutoModel. NVIDIA says that change alone can deliver 3.4x to 3.7x higher fine-tuning throughput and reduce GPU memory use by 29% to 32% compared with native Transformers v5 in its benchmarks.

MoE models have become increasingly important in frontier AI because they route different inputs to different expert networks, allowing large models to stay efficient at scale. But that architecture also brings complexity. Routing tokens across experts, distributing weights across GPUs, and coordinating communication with computation all place heavy demands on training infrastructure. NVIDIA says those needs are a good fit for NeMo AutoModel, which is meant to provide optimized building blocks on top of the standard Hugging Face ecosystem.

The company says the library subclasses AutoModelForCausalLM and supports common MoE model families, including Qwen3, NVIDIA Nemotron, GPT-OSS and DeepSeek V3. For those models, NVIDIA has added hand-tuned implementations that use custom attention, fused linear layers and specialized expert kernels. For other models, the system falls back to standard Hugging Face behavior while still applying some optimizations.

NVIDIA also highlighted compatibility with distributed training. Users can pass a device mesh or distributed setup into the model load call, allowing multi-GPU training without major code changes. The company said this design helps the library scale from single-node 30B-class models to much larger systems.

In one benchmark, NVIDIA tested a 550B-parameter Nemotron model across 16 nodes, or 128 H100 GPUs, and said NeMo AutoModel fit the training workload within memory limits where Transformers v5 could not. For smaller single-node tests on 8 H100 GPUs, NVIDIA compared NeMo AutoModel with Transformers v4 and v5 using Qwen3-30B-A3B and Nemotron 3 Nano 30B A3B. In those tests, the company reported higher tokens-per-second performance and lower peak memory use with its library.

NVIDIA attributes the gains to three main factors. Expert parallelism spreads expert weights across GPUs to reduce memory pressure. DeepEP combines communication and dispatch work to overlap routing with computation. TransformerEngine supplies fused kernels for operations such as attention, linear layers and normalization.

The release reflects a broader push to make large-model training more practical on open-source tooling. By releasing NeMo AutoModel through Hugging Face and keeping the same loading pattern, NVIDIA is positioning the library as a performance-focused extension rather than a separate workflow. For teams already using Transformers, the appeal is that the upgrade path appears to be minimal, at least from the code side.