Morph says coding-focused engineering speeds code generation with speculative decoding

Morph says code generation can be sped up by tailoring the full inference stack

Morph has outlined a set of engineering changes it says can materially accelerate code generation for AI coding agents, arguing that generic inference stacks leave performance on the table. In a recent post, the company said it focuses on open models such as Qwen, GLM, DeepSeek and MiniMax, but optimizes them for a single workload: the coding agent.

The central idea is that code generation has unusual patterns that make it well suited to specialized optimization. Morph argues that code edits often resemble the surrounding file, prompts are frequently repeated across turns, and many tokens in a coding session are highly predictable. That creates room for systems designed specifically around code rather than broad-purpose text generation.

One of the main techniques Morph highlighted is speculative decoding, where a smaller draft model predicts upcoming tokens and the larger target model verifies them in fewer passes. The company said the key metric is acceptance rate, or how often the target model keeps the draft model’s guesses. According to the post, a draft model trained on the target model’s own coding output can outperform a generic draft model. Morph cited one example in which an off-the-shelf 68 million parameter draft achieved 1.93 times speedup, while a draft trained on the target’s code output reached 3.07 times.

Morph also pointed to speculative decoding variants from other projects, including EAGLE-3 and DFlash, as evidence that the field is moving quickly. But the company said the main limitation is not the algorithm alone. It argued that a usable draft model still has to be trained for the intended target and workload, rather than pulled from a generic stack.

A second area Morph emphasized is kernel optimization. The company said it automatically searches for faster GPU kernels rather than hand-tuning them for specific hardware. According to the post, this approach is especially important on lower-cost GPUs that are not the primary target of frontier-model infrastructure. Morph said one of its warp-decode kernels reached 162 tokens per second on an 80 billion parameter mixture-of-experts model running on an RTX PRO 6000, up from 97 tokens per second, and faster than a cited H100 baseline, without accuracy loss.

The company framed that work as part of a broader argument that default kernels are often tuned for the hardware used by large AI labs, not for the cheaper systems many deployers actually buy. To address that, Morph said it uses an automated loop that proposes kernels, checks correctness against reference outputs, measures performance on production traces, and ships only the versions that are both correct and faster.

Morph’s third focus is interconnect and caching. The post said many economical GPU systems lack NVLink and rely on PCIe, which has much lower bandwidth. To reduce the cost of communication between GPUs, the company said it built its own all-reduce and prefix-caching layers, including support for moving cached context across machines over standard TCP.

The company argued that these changes matter because coding traffic tends to reuse the same prefixes and context across turns, making cache hits especially valuable. In its view, the combination of a trained speculator, faster kernels and improved interconnects allows open models to run more quickly on hardware that is less expensive than the systems often assumed by standard inference stacks.

Morph’s broader message is that speed, not model weights, is the differentiator for serving coding agents. The company said the same open models are available to everyone, but a system built specifically for code generation can make them significantly faster.