Developer details how he built a vintage-style LLM trained on old texts

A developer has published a detailed account of building a small language model designed to reflect English usage from before 1900, using what he describes as custom data pipelines, training scripts and a tokenizer built for the project.

The model, called Vintage LLM, is based on the Llama architecture and has about 340 million parameters. The creator says it is an English-only, time-locked model trained on historical material rather than modern web data. He has released the model on Hugging Face and made the code open source on GitHub.

The project began after the developer encountered Reddit posts about other historical language models earlier this year. He says those examples prompted him to try building one himself. Since then, he says he worked on the model every day, including while sick, and iterated through multiple experiments before settling on the final approach.

A major part of the work centered on data collection and cleaning. The developer said he did not want the model to learn modern topics such as computers, atomic bombs or spacecraft, so he assembled his own corpus from older sources. Those included Project Gutenberg, Oxford Text Archive, Internet Archive books, British Library materials, Library of Congress public domain books, American stories and old newspaper collections. He said he tried to limit the training material to English texts published before 1900.

Filtering and deduplication appear to have been among the most time-consuming pieces of the effort. The developer wrote that he discarded documents when he could not verify a date, even if the text quality was good, and that he removed badly scanned or error-ridden material. To identify low-quality text, he said he used a mix of simple heuristics, including checks on character diversity, compression ratio, entropy and a custom quality score aimed at spotting noisy OCR output.

He also said he tested a range of storage systems for the dataset, including Qdrant, Zvec, Lance, ValKey and LevelDB, before settling on LevelDB for reliability and lower resource use on his hardware. According to the post, the full dataset processing was done on his own PC, while larger training runs were moved to rented GPU services because they would have taken too long locally.

The developer estimates the project cost about $80 in GPU fees. He said the expense was kept low because his desktop handled much of the preprocessing work. His main machine, according to the post, runs Linux, an AMD Ryzen 7 9700X CPU, 64GB of RAM and a Radeon RX 9070 graphics card with 16GB of VRAM.

The creator also said he built a custom tokenizer trained on clean English books because he did not want programming terms or non-English vocabulary taking up space in the model’s lexicon. He described the system as a hobby project and warned that it will still hallucinate. He also said the model has not been aligned or censored, and that this was intentional because such work would reduce historical accuracy.

While he presents the model as an experiment rather than a production system, the project reflects a broader wave of interest in historically themed language models. The developer said he has already been experimenting with earlier, smaller versions and expects future models to benefit from the lessons learned during this run.