Researcher Uses Cutting Planes to Find Optimal Tokenizers in Some Cases

A researcher has described an algorithm that can find provably optimal tokenizers in some settings, using a technique inspired by methods long associated with hard optimization problems like the Traveling Salesman Problem.

The work, shared in a detailed project write-up, shows that while tokenization is generally considered theoretically difficult, practical cases may still be solvable exactly. The researcher said the approach combines integer linear programming with cutting-plane methods to tighten lower and upper bounds until they meet at an optimal solution.

Tokenization as an optimization problem

Tokenizers are the systems that break text into tokens before large language models are trained. Those tokens map to byte sequences, often common words or word fragments. In common practice, teams choose a fixed-size vocabulary that compresses training text as efficiently as possible. The standard method for building such vocabularies is byte-pair encoding, a greedy compression technique that has been used for years.

The project builds on recent academic work that framed tokenization as an integer linear programming problem. In that formulation, one set of variables represents which byte sequences are included in the vocabulary, while another set represents how the text is actually encoded using those entries. The goal is to minimize the total number of tokens needed to represent the dataset.

Because the exact integer problem is hard to solve directly, prior work relaxed it into a continuous linear program and then rounded the result back to an integer tokenizer. That approach can produce a useful approximation, but it does not guarantee an optimal answer.

Using cuts to narrow the gap

The new work tries to do more than round an approximate solution. Instead, it repeatedly solves the relaxed problem, then adds extra constraints that are valid for all true integer solutions but violated by the fractional one. This approach, known as cutting planes, is widely used in optimization and has been especially successful in traveling salesman solvers.

The researcher said initial attempts relied on an AI coding tool to suggest useful constraints, but many early ideas were too weak to matter. A more effective strategy came from brute-force searches on small projections of the problem, which uncovered cuts that improved both the lower bound and the quality of the tokenizer.

Those brute-force findings were then turned into more efficient templates. The most effective family, called cycle constraints in the write-up, looks for groups of overlapping fractional edges that conflict with one another. The researcher said these can be found by building a graph of overlapping colors and running depth-first search to identify cycles.

Early results on book-length text

The experiments were constrained by hardware. The researcher said the work was done on a Mac Studio and Mac mini using CPU-based solvers, mainly HiGHS. To keep the optimization problems manageable, the text was pretokenized into words, repeated words were merged, and rare substrings were dropped from the vocabulary search.

Even with those limits, the method produced provably optimal tokenizers for some toy problems. The researcher highlighted a vocabulary-size-512 tokenizer for Pride and Prejudice that converged after about a dozen iterations and took a little over a day to compute.

When the same setup was expanded to a vocabulary size of 1024, the cycle constraints were no longer enough on their own. The researcher said other cut families were still needed, and those experiments had not yet finished.

Limits and next steps

The write-up stresses that the approach may not be broadly practical yet. Existing tokenizers were already close to optimal in many cases, the researcher said, and a tokenizer that is optimal on training data may not generalize well to held-out text. In addition, efficiency losses can sometimes be absorbed by slightly increasing vocabulary size.

Still, the project points to a possible path toward exact tokenizer search in small or medium-sized cases. The researcher said the main bottleneck now is LP solve time, with some iterations taking hours or even days. Future work could involve scaling the method to larger corpora, finding new cut families, or removing the pretokenization step entirely.

The code for the project has been published on GitHub, along with the resulting vocabulary for the Pride and Prejudice experiment.