A new review of scaling laws in deep learning examines how training loss changes as models, datasets and compute budgets grow, and why those patterns matter for practical resource planning. The central idea is straightforward: when training is scaled up, loss often falls in a predictable power-law pattern, which researchers can use to estimate how much data and compute a future model will need.
That predictability has made scaling laws one of the most useful empirical tools in modern machine learning. Rather than relying only on intuition, teams can fit curves from smaller experiments and then extrapolate to larger training runs. The review argues that this makes scaling laws especially valuable when compute is expensive and choices about model size and data mix can determine whether a project is efficient or undertrained.
The article traces the idea back to earlier learning-curve research, including work from the 1990s and later studies that found error often falls as a power law in data size. Across tasks such as translation, speech recognition, image classification and language modeling, researchers repeatedly observed that larger datasets and larger models reduce error in smooth, often similar ways.
A key point from those studies is that the slope of the curve appears to be tied more to the problem domain than to a specific architecture. In other words, changing the model may shift the curve up or down, but the overall rate at which error falls can remain similar. The review also notes that these empirical laws are more practical than broad theoretical capacity bounds, which often do not capture deep learning behavior well.
The article then turns to the influential 2020 work by Kaplan and colleagues, which helped popularize scaling laws in language models. Their experiments showed that loss fell predictably as model size, dataset size and compute each increased. They also concluded that, under a fixed compute budget, it was better to train a larger model for less time than to fully train a smaller one.
That recommendation later became a point of contention. The review says the Kaplan analysis suggested model size should grow faster than training data when compute rises, implying a relatively small token budget for bigger models. But the later Chinchilla study challenged that view.
Chinchilla examined the same broad question, but with a different experimental design focused on compute-optimal training. Its conclusion was that many large language models had been trained with too few tokens. In this view, improving performance means scaling data and model size more evenly, not favoring parameters so heavily that the model remains undertrained.
The practical implication is clear. If a lab misjudges the balance between parameters and tokens, it can spend a large compute budget inefficiently. A model may have the capacity to do better, but if it does not see enough data, that capacity is wasted.
The review describes several ways researchers fit these laws in practice, including holding model size fixed while varying token budgets, plotting isoFLOP curves, and using parametric fits that estimate loss from smaller experiments. These methods are designed to identify the compute-optimal frontier, where a given FLOP budget produces the lowest expected loss.
The article also emphasizes that scaling laws are not perfect. They can be harder to fit in real-world settings, especially when data is limited, architectures change, or assumptions about unique training tokens do not hold. Still, the overall message is that scaling laws have become a core planning tool for deep learning teams trying to decide how to allocate scarce compute between bigger models and more data.