VibeThinker-3B debuts as a compact model with strong reasoning benchmark results

VibeThinker-3B aims at verifiable reasoning

WeiboAI has released VibeThinker-3B, a 3 billion-parameter dense model positioned around tasks where answers can be checked, including mathematics, coding and STEM reasoning. The model is part of the VibeThinker line and is presented as an effort to test how far smaller language models can go when the target output has clear verification signals.

The model card says VibeThinker-3B was not trained for tool use, agent workflows or function-calling data. For that reason, the developers advise against using it for API orchestration or autonomous coding agents. They instead recommend it for competitive programming-style tasks, where outputs can be evaluated more directly.

Benchmark results against larger systems

According to the model release, VibeThinker-3B performs strongly on a range of reasoning benchmarks, including AIME, HMMT, IMO-AnswerBench, LiveCodeBench and recent LeetCode contests. The developers say these results place it in the range of several much larger frontier models on verifiable reasoning tasks.

One highlighted figure is an IMO-AnswerBench score of 76.4, which the model card says rises to 80.6 when using a test-time strategy called Claim-Level Reliability Assessment, or CLR. The release compares those scores with substantially larger models, including DeepSeek V3.2, GLM-5 and Kimi K2.5, though the benchmark sets and testing methods may differ across systems.

The team also points to out-of-distribution testing on recent LeetCode weekly and biweekly contests written in Python. In that evaluation, VibeThinker-3B reportedly solved 123 of 128 first-attempt submissions, for a 96.1% acceptance rate.

How the model was trained

The developers say VibeThinker-3B builds on the earlier VibeThinker-1.5B work and uses a training approach they call the Spectrum-to-Signal Principle. In broad terms, the process combines supervised fine-tuning with reinforcement learning, self-distillation and an instruction-tuning stage.

The first stage of supervised fine-tuning is designed to cover a wide spread of tasks, including math, coding, STEM reasoning, general conversation and instruction following. A second stage narrows in on more difficult, longer-horizon examples. The team says it also uses a distillation method intended to preserve multiple valid solution paths.

Reinforcement learning is then applied across math, code and STEM tasks, using a 64K long-context window to retain complete reasoning traces. Later steps feed high-quality trajectories back into a unified student model and then tune it further for user-facing prompts with rule-based and rubric-based rewards.

Where the model fits

The model card argues that compact models should not be seen only as cost-saving substitutes for larger systems. Instead, the team says small models may become especially competitive when the task structure is strong and the feedback loop is reliable.

At the same time, the release draws a line between verifiable reasoning and broader open-domain work. It says larger general-purpose models may still be better suited to tasks involving wide knowledge coverage, general dialogue and long-tail understanding.

VibeThinker-3B is available on Hugging Face under an MIT License, with supporting links to GitHub, ModelScope and a technical report. The base model listed is Qwen2.5-3B, and the repository also notes a coder variant in its lineage.