A back-of-the-envelope way to price inference

A recent technical explainer lays out a practical framework for estimating how much it costs to serve AI models to users at scale. The piece argues that, with a few hardware specs and some assumptions about model size and context length, teams can make a rough estimate of per-user inference cost before building full production systems.

The method is aimed at products that run large language models as part of the core stack, where serving efficiency can directly influence subscription pricing and margins. Instead of focusing on exact measurements, the article uses napkin math to show which parts of the inference pipeline are likely to dominate spending on GPUs.

The core idea is to compare compute capacity with memory bandwidth, then use that ratio to estimate how many simultaneous conversations a single accelerator can support. The discussion centers on a high-end NVIDIA B200 GPU, along with a model assumption of roughly 32 billion parameters and a long context window of 200,000 tokens.

Why attention and caching matter

The article starts by breaking down the cost of matrix multiplication, a basic operation that underpins transformer inference. It explains that these operations involve both arithmetic work and memory movement, and that tiling can reduce how often data needs to be read from memory. From there, it connects the math to language models, which process token sequences through repeated attention layers before predicting the next token.

A key point is that naive inference can waste a large amount of compute by reprocessing the full prompt every time a new token is generated. To avoid that, modern inference systems cache key and value tensors from earlier steps in a region of GPU memory called the KV cache. With caching, the model only needs to process the newest token on each forward pass instead of the full conversation history.

That shift dramatically changes the resource profile. In the simplified example, the compute load drops far enough that inference becomes more balanced between arithmetic throughput and memory use. The article uses the B200’s published specifications to show that the chip can perform far more operations than it can move bytes from memory, which means the ideal system should keep both compute and bandwidth busy.

The limits of scale

Using the B200’s claimed 4,500 TFLOP/s compute rate and 8 TB/s memory bandwidth, the article concludes that the hardware can be most efficiently used when serving hundreds of users concurrently. A simple ratio yields an idealized target of about 331 simultaneous users if the workload is perfectly balanced between compute and memory traffic.

That figure is only a theoretical maximum, however. Real deployments must reserve memory for model weights and the KV cache, and that is where the math becomes more constrained. The article notes that a 32B model can occupy about 32 GB of VRAM on its own. A long-context chat session also requires a substantial KV cache, which can exceed the capacity of a single accelerator if no optimization is applied.

To reduce that burden, the explainer points to grouped-query attention, a technique that shares key and value heads across multiple query heads. In its simplified estimate, that approach cuts KV cache usage by about 8x. Even so, long-context sessions still consume significant memory, limiting how many concurrent chats can be hosted on one GPU.

A planning tool, not a precise forecast

The main value of the framework is not precision, but intuition. It helps teams reason about whether a model is likely to be compute-bound, memory-bound or constrained by VRAM footprint. That, in turn, can guide infrastructure decisions, capacity planning and pricing.

For AI companies trying to estimate serving costs before launch, the article offers a reminder that the economics of inference depend not just on model quality, but on how efficiently each token is generated and how much context the system must remember along the way.