{"slug": "making-a-vintage-llm-from-scratch", "title": "Making a vintage LLM from scratch", "summary": "A developer built a 340-million-parameter vintage LLM from nearly scratch, training it exclusively on texts from before 1900 using custom scripts and datasets. The open-source model, available on HuggingFace, is time-locked to a knowledge cutoff of 1900 and cost approximately $80 in GPU compute. The project aims to create a historically accurate chatbot that reflects 19th-century language and norms, though the developer warns it may generate content considered toxic by modern standards.", "body_md": "# Making a vintage LLM from scratch\n\nIn this blog post, I will share the adventures I had creating my own LLM, from (almost) scratch, trained only on old texts.\n\nI made my own base-training and fine-tuning scripts, data processing pipelines and custom datasets.\n\n(\"almost from scratch\" means I did use existing programming languages and libraries, I didn't write in Assembly, just like anyone else who builds an AI \"from scratch\"...)\n\nThe model can be found on HuggingFace: [https://huggingface.co/croqaz/vintage-LLM-340m-v1-base](https://huggingface.co/croqaz/vintage-LLM-340m-v1-base) ;\n\nAll the code is open source at: [https://github.com/croqaz/vintage-LLM](https://github.com/croqaz/vintage-LLM) ;\n\nIf you want to check bigger Vintage models, see my previous post: [Vintage LLM models](/log/entries/260428-1/).\n\n### The idea\n\nThree months ago at the end of February I discovered a few Reddit posts by [Hayk Grigorian](https://huggingface.co/haykgrigorian), where he described creating his temporal gated language model. I was absolutely fascinated.\n\nTraining an LLM only on 1800s London texts, 90GB dataset:[https://reddit.com/r/LocalLLaMA/comments/1pkpsee/training_an_llm_only_on_1800s_london_texts_90gb](https://reddit.com/r/LocalLLaMA/comments/1pkpsee/training_an_llm_only_on_1800s_london_texts_90gb)\n\nLLM trained from scratch on only 1800s London texts brings up a real protest from 1834:[https://reddit.com/r/LocalLLaMA/comments/1mvnmjo/my_llm_trained_from_scratch_on_only_1800s_london](https://reddit.com/r/LocalLLaMA/comments/1mvnmjo/my_llm_trained_from_scratch_on_only_1800s_london)\n\nObviously I read other posts from other people that made their own LLMs, but maybe I wasn't ready to do it myself, or the model they were working on wasn't that interesting. Anyway, the thought of having my own Victorian chat bot... fuckin' epic !!\n\nSince then, I worked on my own \"Vintage LLM\" every single day. Without exceptions. Even when I was sick.\n\nIn the meantime, a lot more historic LLMs have been released like: [Violet-1B4-Chat](https://huggingface.co/zakarth/violet-1b4-chat), [Mr. Chatterbox](https://huggingface.co/tventurella/mr_chatterbox_model), [GPT-1900](https://huggingface.co/mhla/gpt1900-instruct-v3-sft), [Talkie](https://huggingface.co/talkie-lm/talkie-1930-13b-it) and [TypewriterLM-base](https://huggingface.co/typewriter-ai/typewriter-1913-7B-base).\n\n### What, why, where and how?\n\n**What?**\n\nThis is a time-locked LLM/ historical LLM, English only, and its knowledge cutoff is year 1900.\n\n(Limiting to a specific year is error prone, but I did my best effort).\n\nIt is based on Llama architecture and has 340M (0.3B) params.\n\n**Why?**\n\nBecause I can only learn if I do it myself and it's a super fun project.\n\n**Where and how?**\n\nI made my own dataset, my own processing and training code.\n\nThe code is semi-vibe-coded with whatever LLM I had with VS-Code and PI (OpenRouter models).\n\nI checked and validated every single function and I deeply understand what every single code file is doing.\n\nThe dataset processing took the most and I tried all sorts of things that didn't work, and I wasted a ton of time. Complicated solutions are the worst...\n\nI processed all the data on my own PC and I trained smaller versions of the LLM on my PC (*Cachy OS Linux, AMD Ryzen 7 9700X CPU, 64GB RAM, Radeon RX 9070 16GB VRAM*).\n\nAs for the larger 340M model, I trained it on RunPod, ThunderCompute and [Vast.ai](http://Vast.ai). It would have taken forever on my PC.\n\nThe total cost of this project was: ~$80, GPU costs only.\n\nThat's because I have a decent PC to process the data. If I had more RAM, I could have processed some of the data much faster, especially when it comes to de-duplicating texts in memory.\n\n**Disclaimer**: This is a toy/ hobby LLM (but I treat it very seriously).\n\nIt will hallucinate and generate historic semi-accurate content which, at the time was considered normal but by today's standards is considered: toxic, offensive and unsafe. This is expected, because I didn't do any alignment. Aligning (or censoring) the model requires significant effort and it would ruin the historic accuracy.\n\nAlso, I can't guarantee that my model is strictly limited to the year 1900 (even if I did my best) eg: as to perform the \"Albert Einstein test\".\n\n### The plan\n\nI use AI everyday at work and I understand how it works, but I never built an LLM myself. I ran specific AI training and fine-tuning pipelines at work, I built tiny neural networks in C and Python in the past, but when I started this project I didn't know how people are usually building LLMs.\n\nI searched for a week and I chatted with multiple bots to get different points of view (like I always do when I research a topic).\n\nIn short, to build an LLM you need 4 things:\n\n**the data**-- an LLM has no discernment or understanding. It will learn from anything you tell it to, good or bad. This is the longest process.**tokenization**-- the Tokenizer is a little program that converts words or letters into numbers (tokens). LLMs don't understand words, they only understand numbers.**pre-training**-- it's a confusing expression and it means \"base-training\", where the LLM learns to autocomplete text. If you're going for a 300m+ params, this is the most expensive process.**fine-tuning**-- where the LLM learns how to chat in turns, question & answer.\n\nWell, there's a bit more to it than these simple steps, but I won't go super deep in this article.\n\nNow, let's look at each step in more detail.\n\n### Initial experiments\n\nIt's worth mentioning that I made lots of mistakes and I experimented with some datasets and model architectures before I settled on the \"big\" model.\n\n(\"big\" in quotes here, because you know, compared to larger models like Talkie-13B or TypewriterLM-7.24B, my model is just a toy)\n\nSome details about my v2 toy EleutherAI/pythia-14m that I trained on my own PC:[https://github.com/croqaz/vintage-LLM/tree/e272b94fcf96316f874babbed549d20809fe5a39/models/m-v2](https://github.com/croqaz/vintage-LLM/tree/e272b94fcf96316f874babbed549d20809fe5a39/models/m-v2)\n\nIf you look at the validation loss and perplexity SVG files, you'll see huge jumps, because I didn't randomize the file chunks and the dataset files were tokenized in alphabetic order, and it happened that the clean book files were first and as soon as the model started to get exposed to the Time-Capsule dataset, it gradually become worse and worse because that dataset was bad with lots of weird OCR artefacts, broken words & sentences and so on.\n\nMistakes were made ... but I learned from them.\n\nI was stuck for a while, until I started to filter out the bad documents.\n\n### Data\n\nThe data processing was the longest and most boring process by far, and I hope you understand why...\n\nThere are plenty of modern, high quality datasets with data scraped from the internet, but I didn't want my LLM to learn about computers, atomic bombs and space-ships, so I had no choice but to make my own.\n\nLuckily there are some datasets available, but they are pretty bad quality and most of my work was to de-duplicate, filter the really bad texts and enhance some of the existing texts.\n\nThe historic datasets are quite limited, the old books are all we have unless we discover more old books and someone scans them, so we have to use what we have.\n\nA few datasets worth mentioning are: Project Gutenberg, Oxford Text Archive, Internet Archive books, TheBritishLibrary/blbooks, storytracer/LoC-PD-Books, dell-research-harvard/AmericanStories, dell-research-harvard/NewsWire, Heritage Made Digital Newspapers (HMD).\n\nI did my best to find out the year and language of each dataset, so I can limit everything to English before year 1900.\n\nI completely ignored documents that don't specify a year, or I can't find date in the texts, even if they had good quality, just to be safe.\n\nAs a side-project, I created the Book-Metadata HF dataset with lots of old books, title, author, book ID and source: [https://huggingface.co/datasets/croqaz/book-metadata](https://huggingface.co/datasets/croqaz/book-metadata) ; My goal was find out the year of all Gutenberg books, but I only found 5300 books where I'm 100% sure of the year.\n\nAgain, this took absolutely forever and as I am writing this blog post, I'm still not completely finished.\n\nIf I'll ever train another LLM, I'll have more and better data next time.\n\nInitially I wanted several de-duplication methods, including MinHash and embeddings vector similarity. If you don't know what it means, don't worry, I won't go into details. It was way too slow and expensive and I had to abandon it. Just to give you a taste of how slow this was, I had a beefy DEV server calculate embeddings for my short text datasets and I did 10% of the dataset in a week with the server running day & night. The server had a RTX 3090 GPU that I shared and so on.\n\nIn the end, I de-duplicated based on the normalized text (lower-case text with all spacing removed). Basically the text \"hello world\" is identical to the text \" Hello World \" (notice the spaces and title case), and the text is only saved once in my dataset.\n\nFrom the start, I knew that the data is the most important: garbage in -> garbage out. I did lots and lots of experiments and iterations, I played with DBs like Qdrant, Zvec, Lance, ValKey, and LevelDB for storing the datasets.\n\nI dropped Qdrant because the DB disk size was huge even before I started adding many entries.\n\nI dropped Zvec because I had no way to cycle the DB entries, basically once you save them, there's no way to explore the DB. I [created an issue for this](https://github.com/alibaba/zvec/issues/380). Zvec is also super new and I should probably give it more time to mature.\n\nI dropped Lance because of the versioning, the DB becomes slower and slower once you start adding more than a few million entries. This could be my fault, I'm sure I can find out a way to do this better.\n\nI dropped ValKey because I ran out of memory, after injesting something like 10 mil records, the server started to OOM crash on my PC, but I still had to injest way more data. Other than that, ValKey was really great.\n\nI ended up using LevelDB (which was used by local wallet apps to store BitCoin and Ethereum transactions), so I know it can scale on my PC. I injested 12 mil rows without any issues and with minimal CPU or RAM usage. LevelDB can be slow at times, but it's consistently reliable.\n\nIf I had a better PC, or a super computer, I would probably have used ValKey all the way.\n\nTo get a sense of the quality of the texts, I was first looking at the length and unique characters of each document. For the first stage I decided to use short texts (up to 32k long) and for the second, long texts up to 10MB. English shouldn't use more than 30-50 symbols normally; if a chunk of text has 100 or more unique symbols, it's not English, so I discarded those. And if a text has only 8 unique symbols, it doesn't make any sense, so I removed those too.\n\nI applied 3 more filters.\n\nSuper easy metric, the compression ratio of ZLIB. Text that is too short and diverse will have a big value and text that is super repeated will be really small value.\n\n``` python\n# ZLIB compression ratio\n# A good window is 0.5...0.7;\ndef compression_ratio(text) -> float:\n    raw = text.encode(\"utf-8\")\n    compressed = zlib.compress(raw)\n    return len(compressed) / len(raw)\n\ncompression_ratio(\"Lorem ipsum dolor sit amet\")\n# 1.3\ncompression_ratio(\"other, and other and other\" * 100)\n# 0.01 -- very repeated\ncompression_ratio(\"The President has nominated Thomas Johnson, William Cranch, and Charles\\nSimms, Judges of the district of Columbia.\\n\\nOn Saturday last, Thomas Jefferson, at\\npresent Vice President of the United States,\\nand President of the Senate, took leave of\\nthat body on which occasion he delivered\\nthe following address:\\n\\nGentlemen of the Senate,\\n\\nTo give the usual opportunity\")\n# 0.64 -- Regular text\n```\n\nAlso the Shannon entropy:\n\n```\n# Shannon Entropy\n# The estimated entropy rate for printed English is approximately 4.2...5.5;\ndef char_entropy(text) -> float:\n    counts = Counter(text)\n    total = len(text)\n    entropy = 0.0\n    for count in counts.values():\n        p = count / total\n        entropy -= p * log2(p)\n    return entropy\n\nchar_entropy((\"a \" * 10 + \"!\"))\n# 1.22 -- Too low\nchar_entropy(\"Lorem ipsum dolor sit amet\")\n# 3.6 -- A bit low\nchar_entropy(\"IN the High court of Chancery for the Rich\\nmond District,\\nBetween\\nHenry Banks plaintiff,\\nAnd\\nNathaniel Anderson, Robert Pollard.\")\n# 4.5 -- Regular english\nchar_entropy(''.join(chr(i) for i in range(200)))\n# 7.6 -- Super high entropy\n```\n\nAnd my own quality detection filter. This helped me identify very bad OCR texts, with lots of weird symbols:\n\n```\n_LETTER_RE = re.compile(r'[a-zα-ωàâäçèéêëîïôöùûüüÿæœß]$', re.I)\n_DIGIT_SPACE_RE = re.compile(r'[0-9 \\n]$')\n_PUNCT_RE = re.compile(r'[.,;!?\\'\"_\\-]$')\n\n# Cro's custom quality score\n# Regular letters get a score of 2;\n# Digits and spaces get a score of 1;\n# Punctuation gets 0.5;\n# Any other characters get -0.5.\ndef quality_score(text: str) -> float:\n    score = 0.0\n    for c in text:\n        if _LETTER_RE.match(c):\n            score += 2\n        elif _DIGIT_SPACE_RE.match(c):\n            score += 1\n        elif _PUNCT_RE.match(c):\n            score += 0.5\n        else:\n            score -= 0.5\n    return ((score / len(text)) - 0.75) * 100\n\nquality_score(\"-.•-, , _ ._. , •\\n1.\\n- ,\\n.... - • 4\\n-------._\\n..77/:',. .., ..., : •\\n':'#)).,-'' 1 -\\n__LI -......-_-.A..-...',... ) •\\nSilli° , A AilitA4‘\\n' •:.:-',73 ..-.-..-_-_,N>\\n4. .41t.\")\n# -8.72\nquality_score(\"Lorem ipsum dolor sit amet\")\n# 109.61\n```\n\nI after calculating these metrics, I simply removed the documents with low quality.\n\nThe percent of removed documents was less than 1%.\n\nAll of these experiments took forever, and the limiting factor was always the compute. There's a few more datasets that I could process (eg: Institutional Books 1.0, massive 947GB dataset, 983K books, 386M pages, etc) but I don't want to kill my PC, and I already have plenty of data for my tiny LLM.\n\nYou can find my dataset on HuggingFace: [https://huggingface.co/datasets/croqaz/vintage-v1](https://huggingface.co/datasets/croqaz/vintage-v1) ;\n\n### Tokenization\n\nI had to make my own tokenizer, because my model will never need a vocabulary like \"async function\", \"import sys\", \"public class Main\", or any other programming language terms.\n\nAnd this is an English only LLM, so I don't want it to pickup expressions from other languages.\n\nIf I had used an existing tokenizer, all of those useless words would have been wasted in the vocabulary.\n\nI trained my tokenizer on English Gutenberg and Oxford books, because they are very clean.\n\n### Base-training stage-1\n\nOnce I had a basic dataset, I quickly created a super basic [Pythia-14M params](https://github.com/EleutherAI/gpt-neox/blob/main/configs/pythia/14M.yml) model (that's 0.01B params), using [litGPT](https://github.com/Lightning-AI/litgpt) framework. It was pretty straightforward and painless initially.\n\nI had to do some tricks to make it work on my PC, because I have a Radeon GPU and most of the AI training scripts just assume you have an NVIDIA GPU...\n\nIn just one hour, I had a tiny LLM that was generating semi-random English-sounding gibberish, eg:\n\n```\nHe presented them with one of the soul, which was made in the same manner ;\nI have also been at this moment's, and that it was not that this month should be allowed to\nbe without a short time. It was not yet a word for the sake of the man who was known that he\nwas in the case of the law that he was going herein, which was made to it, and be a thing\nto be understood, that the word must be ...\n\nIt was a wild and strange desire, and has, on his head, and the only farm, or the black sky,\nhe was in the midst of the face of the night.\nBut his first illness was a great, and the heart he was a man who had his hands,\nfor he was a very much gotten, and he was very weak.\nHis mouth were, but was in his hand.\nHe was a man of great honour, and there was one of ...\n```\n\nAfter playing with it a bit more, I tried to train a bigger Pythia-70M and litGPT started to crash randomly with errors like:\n\n```\nERROR: Unexpected segmentation fault encountered in worker.\nraise RuntimeError(\n  f\"DataLoader worker (pid(s) {pids_str}) exited unexpectedly\"\n) from e\n```\n\nI had the exact same tokenizer and dataset for Pythia-14M and I tried a Pythia-31M and Pythia-160M, but the 70M just didn't work. I tried debugging this for a while without success, I wasted a lot of time and in the end I decided to make my own training script, loosely inspired from litGPT and nanoChat.\n\nI discoverd that HuggingFace transformers library has a [Trainer class](https://huggingface.co/docs/transformers/main_classes/trainer) and I did some vibe-coding to create something basic.\n\nI downloaded the source code for [nano-GPT](https://github.com/karpathy/nanoGPT), [nano-Chat](https://github.com/karpathy/nanochat), [NanoWhale-100m](https://github.com/huggingface/nanowhale), and a few more repos with code \"from scratch\", and I got different LLMs to extract the technical details of how they do the base-training, then I asked the LLMs to apply the best features on my own training script. This way I managed to get something pretty solid and flexible in a few iterations.\n\nI tweaked the code over and over again by hand, until I got it to my liking, and I understood what it was doing.\n\nI intentionally didn't want to use nanoChat, because I wanted 100% compatibility with the HuggingFace libraries and with GGUF in the future.\n\nAfter training a few debug versions on my own PC, it was time to go bigger and train in the cloud.\n\nI tried a 200M params (w 2048 tokens context) and a 340M params (w 4096 tokens context), both based on the Llama3 architecture.\n\nRunPod was OK and featureful, but pricey. I trained the 0.2B params LLM, but I didn't upload that anywhere, because it had a very small context, and I decided to discard it, even if I spent some money... I really wanted a 4k+ context size.\n\nI started training the 0.3B params Llama on ThunderCompute. It was very slow and the production price of the GPUs was even higher than RunPod, I big disapointment, but I put $25 credit and I wanted to spend all of it so I let the training run to consume all the credits.\n\n[Vast.ai](http://Vast.ai) had all sorts of issues at first with their base Docker image, not being able to start any instances, but after some trying and re-trying, I found a good instance in Czechia with an NVIDIA RTX PRO 5000 and I let it run for 10,000 steps (a bit less than 3B tokens). I will definitely try [Vast.ai](http://Vast.ai) again, because the price is very competitive and once you find a good instance, it's very fast and stable.\n\nThis stage cost me $25 + $15 on ThunderCompute + [Vast.ai](http://Vast.ai).\n\nTraining metrics for this stage:\n\n### Base-training stage-2\n\nI used the long texts, from ~20k up to 10MB in size.\n\nIn this stage I realized that the long texts have very good scores for all previous metrics: custom quality, entropy and compression, but they contain large chunks of bad OCR artefacts, eg:\n\n```\n--- Match #5/12310 (0fc6a6f36d5e1f02f9e5b000c480efe5005b6be5de6f819b9309b60b577fb242) ---\n{\n  \"source\": \"LOC\",\n  \"len\": 385331,\n  \"uniqChar\": 64,\n  \"tokens\": 73087,\n  \"sentences\": 4555,\n  \"entropy\": 103.85,\n  \"quality\": 95.95,\n  \"compress\": 91.92,\n  \"text\": \"M; \\n\\n\\n\\n1 ? 70 \\n\\n\\n\\nary. \\n\\n\\n\\nwtodrawn. timm \\n\\nM COLLEGE LIBRARY \\n\\n\\n\\nMARK \\n\\n\\n\\nY \\n\\n\\n\\n\\\\ \\n\\n\\n\\nJ \\n\\n\\n\\nREV. JAMES DAVIE S \\n\\n\\n\\nCOFI ANT \\n\\n\\n\\nY DIWEDDAR \\n\\n\\n\\nB ARC II. JAMES DAYIES, \\n\\n\\n\\nGan Y PARCH. G. GRIFFITHS, \\n\\nCINCINNATI, 0. \\n\\n\\n\\nUTICA, N. Y. \\n\\nT. J. GRIFFITHS, ARGRAFFYDD, EXCHANGE BUILDINGS. \\n1875. \\n\\n\\n\\nI \\n\\nw \\n\\n\\n\\nC YN W YS I AD. \\n\\n\\n\\nAt y Darllenydd 4 \\n\\nRhaglith 5 — 11 \\n\\ncofiant 12 — 4q \\n\\nEnglynion Coffadwriaethol 49 \\n\\nPregethau 50—96 \\n\\nAmlinelliad o'i Brif Weddnodau 97 — 106 \\n\\nAdgofion Personol am Dano 107 114 \\n\\nAdgofion Brodyr Eraill am Dano 114— 133 \\n\\nYsgubau o Loffion Difyrus am Dano i34_ I4 g \\n\\nPenillion Coffadwriaethol 149—156 \\n\\nEnglish Department 157 ^3 \\n\\nCawellaid o Fan Saethau ei Fwa 174—187 \\n\\nPynciau Ysgol < . 188—222 \\n\\n\\n\\nAT Y DARLLENYDD. \\n\\n\\n\\nEr ys llawer o flynyddoedd cyn ei farwolaeth ceisiasai y Parch. \\nJames Davies gan ei hen gyfaill, Iorthryn Gwynedd, ysgrifenu ei \\ngofiant, os goroesai ef. Addawodd yntau wneyd; a chydnabyddid yn \\ngyffredinol ddoethineb y dewisiad. Ond o herwydd anghydwelediad \\ncydrhyngddo a'r perthynasau, parthed i ddull ei ddygiad allan, efe a \\nymwrthododd yn bendant a'r cyfrifoldeb. Teimlwn fod yn ddyled- \\nswydd arnom ddweyd cymaint a hyna mewn cyflawnder a'r brawd \\nThomas, er mwyn symud ymaith y dybiaeth mewn rhai manau, mai \\nam na roddasid iddo fwy o arian at yr anturiaeth yr ymryddhaodd \\noddiwrth ei ymrwymiad. Wedi iddo ef nacau, disgynodd y gwaith \\narnom ni, trwy ddewisiad unol, a thaer ddymuniad y weddw a'i \\nphlant. Yr oedd yn anhawdd genym gydsynio, o herwydd amledd a \\nphwysigrwydd ein dyledswyddau i bobl ein gofal. Ond (i'w credyd \\ncofnoder y ffaith), pan ddeallasant fod hyny yn peri i ni betruso, pas- \\niasant yn unfrydol benderfyniad i ysgafnhau ein dyledswyddau idd- \\nynt hwy tra yn parotoi y bywgraffiad, ac wedi y gorphenem, eu bod \\nyn rhoddi i ni fis o orphwysiant.   ...   nnheilwng. Ymdrechwn fod yn gyfarwydd a nodau gwir \\nGristion. Nac ymorphwyswn ar farn dda a ddwg dyn- \\nion am danom. Gymerwn y gair yn rheol ; ymarferwn \\nffydd ar aberth Iesu, a llefwn am oleuni a chymorth yr \\nYsbryd Glan at y gorchwyl. Gochelwn gymeryd ein \\ntymer na'n teimladau yn rheol i holi ein hunain ; ond \\nBeibl Duw fyddo ein prif faen prawf. Dyma reol y \\nfarn. \\n\\n\\n\\nY DIWEDD. \\n\\n\\n\\n\"\n}\n--- Match #31/87899 (70fa969d373b988d5537a08ed47ccd6842c17bdcecb2cb55181e3cf37f3e13a1) ---\n{\n  \"source\": \"LOC\",\n  \"len\": 330083,\n  \"uniqChar\": 64,\n  \"tokens\": 62805,\n  \"sentences\": 5311,\n  \"entropy\": 104.92,\n  \"quality\": 94.84,\n  \"compress\": 92.95,\n  \"text\": \"\\nQass_„L X724 ( \\nBook _ \\n\\n\\n\\nCOFIANT, \\n\\n\\n\\nPREGETHAU, A BARDDONIAETH \\n\\n\\n\\nY DIWEDDAE \\n\\n\\n\\nBARCH. T. PIERCE, LIVERPOOL; \\n\\n\\n\\nPAKCH. H. E. THOMAS, \\n\\nBIEKENHEAD. \\n\\n\\n\\nLIVEKPOOLr AEGEAEEWYD GAN THOMAS LLOYD, \\n\\nTITHEEAKN STEEET. \\n\\n\\n\\nIBXtxgq \\n\\n\\n\\nIn Exchange \\nThe Nat. Libr'y of Wales \\nAug. 1928 \\n\\n\\n\\nAT Y DARLLENYDD, \\n\\n\\n\\nAmcenAIS wneyd y Cofiant nior fyr a cbryno ag y galhvn, er mwyn \\nxhoi y Pregetliau a'r Farddoniaetb i mewn. Diau y buasai yn dda \\ngan lawer ei gael yn gynt, ond nid oedd modd i mi \\\"vraeyd, a \\nebredwyf fod peth mantais Avedi ei gyraedd trwy yr oediad. Dylai \\nesgeulustra Mr. Pieece yn cadw cofnodion o'i fywyd, fod yn \\nrhybydd i'r rhai y byddai eu Cofiant o ddyddordeb ac addysg i'r \\nbyd. Dymunaf gydnabod yn y modd mnvyaf diolcbgar y lliaws \\nCyfeillion a roisant bob cymbortb i mi gasglu y defnyddiau, ac \\nyr -wyf yn gadael ar y rbai oeddynt yn adnabod fy amvyl dad \\ny'nghyfraith oreu i fainu am gywirdeb y Cofiant. Bu llawer, ac \\nrnae cryn nifer eto o'i Bregethau goreu ar wasgar, y rbai pe eu \\ncesglid gyda'r gweddill Barddoniaetb, a alien t -vraeyd Ail Gyfiol \\nwertbfawr. \\n\\nGwelir oddiwrth restr y Tanysgrifwyr am hvyddiant yr antur- \\naetb, ac y mae genyf addewid y gwertbir Uawer mewn ardaloedd \\nna ddaetb dim emvau oddiyno. \\n\\nBydd yn rbwydd iawn i'r cyfarwydd ddiwygio yr ycbydig \\nwallau argraffyddol sydd yn y Gyfrol. \\n\\nGan ddiolcb am bob cymbortb, a dymuno am bob llwydd a \\n\\nbenditb, \\n\\nY gorpbwysaf, &c, \\n\\nH. E. THOMAS, \\n\\n50, Exmoeth Street, Birkenhead, \\n\\nMehefin 28ain, 1804. \\n\\n\\n\\nCYNHWYSIAD. \\n\\n\\n\\nCOFIANT, \\n\\n\\n\\nTUDAL. \\n\\nPen. I. Prif helyntion ei \\nFywyd Boreuol 5 \\n\\nPen. II. Ei Fywyd a'i lafur \\nGweinidogaethol . . . . 17 V. Esgeuluso cyd- \\ngynnulliad \\n\\nPeeg. VI. Anfeidrol rinwedd \\naberth y Groes \\n\\n\\n\\n62 \\n\\n\\n\\ni)0 \\n\\n69 \\n\\n71 \\n71 \\n\\n74 \\n\\n\\n\\nPeeg. VII. Galwad ar Sion i \\nDdeffroi \\n\\nPeeg. VIII. Y gwr ar y march \\ngwyn \\n\\nPeeg. IX. Gogoniant a di- \\ngonolrwydd trefn gras yn \\nngwyneb trueni dyn fel \\npechadur \\n\\nPeeg. X. Yr henbobl.. .. \\n\\nPeeg. XI. Gogoniant ac ar- \\ndderchawgrwydd eglwys Dduw \\nyn y dyddiau diweddaf . . \\n\\nPeeg. XII. Cysylltiad bywyd \\nduwiol a manv dedwydd . . \\n\\n\\n\\n78 \\n\\n\\n\\n79 \\n\\n\\n\\n97 \\n\\n\\n\\nBAEDDON1AETH. \\n\\n\\n\\nMawrnad Dewi Wyn o Eifion 101 \\nHaman a Mordecai . . . . 121 \\nGalareb am y diweddar Barch. \\n\\nJ. Breeze 137 \\n\\nMarwnadyParch.J. P.Davies 148 \\n\\nCywydd i'r Goleudy . . . . \\n\\nA WORLD LEADER IN PAPER PRESERVATION \\n1 1 1 Thomson Park Drive \\nCranberry Township, PA 16066 \\n(724)779-2111 ' \\n\\n\\n\\n\"\n}\n```\n\nThese samples have very good scores overall, but they are useless. I am guessing it's not English text... I counted a few hundred examples mostly from LOC-PD and other few hundred in the OTA datasets. Imagine if I feed that crap to my LLM, what will it learn?\n\nI searched around and I decided to calculate a few more metrics like: dictionary word search, vowels in words, noise count. With these news values, I was able to keep only the best quality documents, after I combined all the values together in a single score. I'm sure it's not perfect, but when I explore the dataset I'm pretty happy with the data I'm seeing.\n\nI calculated that the low quality documents were less than 1% of total, which was consistent with the short texts dataset.\n\nBy the way, all of this code to import, query and count the dataset entries is open source: [https://github.com/croqaz/vintage-LLM/tree/main/dataset](https://github.com/croqaz/vintage-LLM/tree/main/dataset) ;\n\nI have a ton more data that I could use if I had more compute, or if I trained a larger LLM, but I decided to limit myself because this was taking forever already.\n\nI ran this training on [Vast.ai](http://Vast.ai) again, on a RTX PRO 6000 WS in France, for 10,150 steps (~6B tokens).\n\nI did 2 rounds of training as you can see in the metrics below, because I started the instance and my tokenization script crashed due to OOM just when I was uploading the model checkpoints, so while I fixed it and ran it again, I quickly tokenized a sub-dataset of old poems, quotes, proverbs and Oxford clean books that I wanted to use anyway (all within the time period) just to keep the GPU busy. The second round was just the randomized long documents mentioned above.\n\nThis stage cost me $40.\n\nMetrics for this stage:\n\nAcording to the Chinchilla Scaling Law, 3+6=9B tokens is roughly the correct amount of training tokens for this 340M params model.\n\n### Fine-tuning\n\nThis stage is not finished as I'm writing this blog, and I'll make another post when it's ready.\n\nI created a series of Python scripts to generate a \"base knowledge\" dataset so I can fine-tune the model on, to make it understand dialogs. Basic question/ answers about foods, fruits, vegetables, some animals and birds, colors, a few numbers. After fine-tuning it a few times with different tweaks, I got it to spit stable dialogs. I was happy to see that it's learning the structure of Question:... followed by Answer:... with just a few hundred examples.\n\nA few samples at this point, with the final model + a light fine-tuning with my [CommonSense dataset](https://huggingface.co/datasets/croqaz/commonsense-v1).\n\n```\nUser: What is God?\nLLM #1: God is the absolute ruler, and in him only we have hope and strength.\nLLM #2: A Being of a God-like character would be a being who knew himself to be the supreme Being, and by whom alone we can act.\nLLM #3: The sun shines on the earth and the stars reflect the light.\n\nUser: What is love?\nLLM #1: Love is a powerful impulse and it enables us to overcome our inclinations.\nLLM #2: Love is the result of a certain condition of the human being in which love exists; as, love a man because he loves you.\nLLM #3: Love is a desire to please and the object of an action is the gratification of some passion or affection for a particular person or thing.\n\nUser: What is life?\nLLM #1: Life is a warfare against pain, and the longer we live the more we suffer.\nLLM #2: Life is a great struggle for existence, in which each one has to sacrifice his own individual pleasure and to seek the best good of others.\nLLM #3: Life is life, and its issues are to the soul what the heart is the heart.\n\n# Trying a few greetings\n\nUser: How are you?\nLLM #1: Heat, fire and cold are required of them in order to overcome their resistance.\nLLM #2: We use our best books on astronomy.\nLLM #2: The weather is charming.\n\nUser: Greetings, good sir!\nLLM #1: The best of all is that when you have done.\nLLM #2: Gladly! Thanks for the honor. I was thinking of writing one.\nLLM #2: No; I hate their thoughts. They are not of my mind, but my memory.\n\n# Asking for a few names\n\nUser: What's your name?\nLLM #1: Charles Emile de Saint-Valerie.\nLLM #2: Ninette. I belong to the ancient class.\nLLM #3: Charles Rickards.\nLLM #4: Charles Bamford, of the \"Nineteenth Century\".\nLLM #5: My mother was a slave, my father a slave.\n```\n\nAll samples generated with my \"[generate.py](https://github.com/croqaz/vintage-LLM/blob/main/training/generate.py)\" script, *temperature=1.0, top-p=0.9, top-k=25, min-p=0.01, repetition-penalty=1.1*.\n\nThe model halucinates a lot and doesn't really follow dialogs. You can't chat with it yet, it needs more baking.\n\nAll dialogs are like a gambling machine: add a coin, pull the lever and hope for the best. You can sometimes get something nice.\n\nIt was a nice surprise to see the model answer so well to the questions of God, love and life, especially that I don't have any of that in my \"base knowledge\" dataset, so the model picked it up from the long base-training and learned to apply the chat template on top.\n\n### Can the model memorize math?\n\nIn my fine-tuning knowledge file I had some simple math operations that I casually added without thinking too much, but when I was vibe-checking the model with different prompts across stages of training, I was actually asking \"How much is 1 + 1?\" to see how it goes and the answer was usually a random number, or complete nonsense.\n\nSo I wanted to find out: can the model memorize base math operations for numbers, say, -2 to 12?\n\nI generated a dataset of 5300 questions about math, all basic operations (+ - / *) for all numbers between -2 up to 12, with different ways of asking, eg:\n\n- \"What is 10 - 3?\"\n- \"How much is 7 * 10?\"\n- \"Multiply 9 by 2.\"\n- \"Calculate -2 * 6.\"\n\nI made a math benchmark where I was asking \"Calculate X and Y.\" for all possible operations and number pairs (0 to 12) (I skipped the negative numbers).\n\nFine-tuning is super fast compared to full training, so I was able to experiment a lot and see the result within minutes.\n\nI fine-tuned a few models from the base model, played with different hyper params and I got the best results with LR=5e-5, max_seq_length=2048, neftune_noise_alpha=0.0 (very important for this case) and obviously temperature=0.\n\nThe best score:\n\n```\n[ + ]  addition: 75/169 (44.4%)\n   y│   0   1   2   3   4   5   6   7   8   9  10  11  12\n  ──┼────────────────────────────────────────────────────\n  0 │   ✗   ✗   ✓   ✗   ✗   ✓   ✗   ✗   ✗   ✗   ✓   ✗   ✓\n  1 │   ✓   ✗   ✓   ✓   ✓   ✗   ✗   ✗   ✓   ✗   ✗   ✓   ✓\n  2 │   ✗   ✗   ✗   ✗   ✗   ✗   ✗   ✗   ✗   ✓   ✗   ✓   ✗\n  3 │   ✗   ✗   ✓   ✓   ✗   ✗   ✓   ✓   ✗   ✓   ✗   ✓   ✓\n  4 │   ✗   ✗   ✗   ✗   ✓   ✓   ✓   ✓   ✗   ✗   ✗   ✓   ✓\n  5 │   ✓   ✗   ✗   ✗   ✗   ✓   ✗   ✓   ✗   ✗   ✗   ✓   ✓\n  6 │   ✓   ✗   ✗   ✓   ✓   ✗   ✓   ✓   ✓   ✗   ✗   ✓   ✓\n  7 │   ✓   ✗   ✗   ✓   ✓   ✓   ✓   ✗   ✓   ✓   ✓   ✓   ✓\n  8 │   ✗   ✗   ✗   ✗   ✗   ✗   ✗   ✓   ✗   ✗   ✗   ✓   ✗\n  9 │   ✗   ✗   ✗   ✓   ✗   ✓   ✗   ✓   ✗   ✓   ✗   ✗   ✓\n 10 │   ✗   ✗   ✗   ✗   ✗   ✓   ✗   ✗   ✗   ✓   ✓   ✗   ✓\n 11 │   ✗   ✗   ✓   ✓   ✓   ✓   ✗   ✓   ✓   ✓   ✓   ✓   ✓\n 12 │   ✗   ✗   ✓   ✓   ✗   ✓   ✗   ✓   ✗   ✗   ✗   ✓   ✗\n\n[ - ]  subtraction: 114/169 (67.5%)\n   y│   0   1   2   3   4   5   6   7   8   9  10  11  12\n  ──┼────────────────────────────────────────────────────\n  0 │   ✗   ✓   ✓   ✓   ✗   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓\n  1 │   ✓   ✗   ✗   ✓   ✓   ✓   ✗   ✓   ✓   ✓   ✗   ✓   ✓\n  2 │   ✗   ✓   ✗   ✓   ✓   ✗   ✓   ✓   ✗   ✓   ✓   ✓   ✓\n  3 │   ✗   ✓   ✓   ✓   ✓   ✗   ✓   ✓   ✗   ✓   ✓   ✓   ✗\n  4 │   ✗   ✓   ✓   ✓   ✓   ✓   ✗   ✓   ✗   ✓   ✗   ✓   ✗\n  5 │   ✓   ✓   ✗   ✓   ✓   ✗   ✓   ✓   ✓   ✓   ✗   ✗   ✓\n  6 │   ✗   ✓   ✓   ✓   ✓   ✓   ✗   ✓   ✓   ✓   ✗   ✓   ✓\n  7 │   ✓   ✓   ✗   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✗   ✗\n  8 │   ✓   ✓   ✗   ✓   ✓   ✗   ✓   ✓   ✗   ✓   ✗   ✗   ✓\n  9 │   ✗   ✓   ✗   ✓   ✓   ✗   ✓   ✓   ✓   ✓   ✗   ✗   ✓\n 10 │   ✗   ✓   ✗   ✓   ✗   ✓   ✗   ✓   ✓   ✓   ✓   ✓   ✓\n 11 │   ✗   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✗   ✗   ✗\n 12 │   ✗   ✓   ✗   ✓   ✗   ✗   ✗   ✓   ✓   ✗   ✗   ✓   ✗\n\n[ * ]  multiplication: 118/169 (69.8%)\n   y│   0   1   2   3   4   5   6   7   8   9  10  11  12\n  ──┼────────────────────────────────────────────────────\n  0 │   ✓   ✓   ✓   ✓   ✓   ✗   ✓   ✓   ✗   ✓   ✗   ✓   ✓\n  1 │   ✗   ✓   ✗   ✗   ✓   ✓   ✓   ✓   ✓   ✓   ✗   ✓   ✗\n  2 │   ✓   ✓   ✓   ✗   ✗   ✗   ✓   ✓   ✓   ✓   ✓   ✓   ✓\n  3 │   ✓   ✗   ✓   ✓   ✗   ✗   ✓   ✓   ✓   ✓   ✓   ✗   ✗\n  4 │   ✓   ✓   ✓   ✗   ✗   ✗   ✗   ✗   ✓   ✗   ✓   ✓   ✓\n  5 │   ✓   ✓   ✓   ✗   ✓   ✗   ✓   ✓   ✓   ✓   ✓   ✓   ✓\n  6 │   ✓   ✓   ✓   ✗   ✓   ✗   ✓   ✓   ✓   ✓   ✗   ✓   ✗\n  7 │   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✗   ✓   ✗\n  8 │   ✓   ✓   ✗   ✓   ✗   ✗   ✓   ✓   ✓   ✓   ✓   ✓   ✗\n  9 │   ✗   ✓   ✓   ✓   ✓   ✗   ✓   ✓   ✓   ✓   ✗   ✗   ✗\n 10 │   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✗   ✗   ✓   ✗   ✗   ✓\n 11 │   ✓   ✓   ✓   ✓   ✗   ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓\n 12 │   ✗   ✗   ✓   ✓   ✗   ✗   ✗   ✓   ✓   ✓   ✗   ✓   ✗\n\n[ / ]  division: 85/156 (54.5%)\n   y│   0   1   2   3   4   5   6   7   8   9  10  11  12\n  ──┼────────────────────────────────────────────────────\n  0 │   ·   ✓   ✓   ✓   ✗   ✗   ✓   ✓   ✓   ✓   ✓   ✓   ✓\n  1 │   ·   ✗   ✗   ✓   ✓   ✗   ✗   ✓   ✓   ✓   ✗   ✓   ✗\n  2 │   ·   ✓   ✓   ✗   ✓   ✗   ✗   ✓   ✗   ✓   ✓   ✓   ✓\n  3 │   ·   ✗   ✗   ✗   ✗   ✓   ✓   ✓   ✗   ✓   ✓   ✓   ✓\n  4 │   ·   ✓   ✗   ✗   ✓   ✓   ✓   ✗   ✗   ✗   ✓   ✗   ✗\n  5 │   ·   ✓   ✗   ✗   ✗   ✗   ✓   ✓   ✗   ✗   ✓   ✓   ✓\n  6 │   ·   ✗   ✓   ✗   ✓   ✓   ✗   ✓   ✗   ✗   ✓   ✗   ✗\n  7 │   ·   ✗   ✗   ✓   ✓   ✗   ✓   ✗   ✓   ✓   ✓   ✓   ✗\n  8 │   ·   ✗   ✓   ✓   ✓   ✗   ✗   ✗   ✗   ✓   ✓   ✗   ✗\n  9 │   ·   ✓   ✓   ✓   ✗   ✗   ✗   ✓   ✗   ✗   ✗   ✗   ✓\n 10 │   ·   ✓   ✗   ✓   ✓   ✓   ✗   ✓   ✓   ✓   ✗   ✓   ✗\n 11 │   ·   ✓   ✓   ✓   ✗   ✗   ✓   ✓   ✗   ✓   ✓   ✗   ✓\n 12 │   ·   ✗   ✗   ✗   ✗   ✗   ✗   ✓   ✓   ✓   ✓   ✓   ✗\n\n  SUMMARY\n  +  addition           75/169  ( 44.4%)  ████████\n  -  subtraction       114/169  ( 67.5%)  █████████████\n  *  multiplication    118/169  ( 69.8%)  █████████████\n  /  division           85/156  ( 54.5%)  ██████████\n  --------------------------------------------------\n  TOTAL             392/663  ( 59.1%)\n```\n\nI am surprised that the model consistently performs better at subtraction than addition, because the way subtraction is formulated is a bit more complicated, eg: \"Subtract 2 from 3.\" and \"Subtract 3 - 2.\", the order of numbers is reversed. I would have guessed that the model will perform the best at addition (like humans do), but addition had the worst score...\n\nAnyway, I wasn't able to get a score better than 60%, so I think this model can't memorize basic math.\n\nCorrect me if I'm wrong! Leave a comment and tell me what I'm missing!\n\nObviously I won't upload this model on HuggingFace just so you download a 650MB brain that answers 1 + 1 correctly half the time... Use a calculator OK?\n\n### Vibe checking during training\n\nPlaying with the model at different steps to see what they generate.\n\nIt's interesting that, at the begining of training, the model is learning the order of words, what word should follow what. Then, it starts to learn the words in a larger sequence like a sentence. Then, it learns to stitch 2-3 sentences together. If you continue to train it, it will learn how to build a long paragraph, but it takes a while...\n\n### Step 200\n\nThe model is still mostly random noise.\n\n```\n--- Prompt 1 ---\nUser: Hello!\n\nModel:\nHello!ltO,til,le\ncansia\nthe tst. .\nm'\no. the i6n\na 1\nM. B. C. C. PI\n.h.,\nThe\nFn.l\ni.\nn.\nL\n\n--- Prompt 2 ---\nUser: Who are you?\n\nModel:\nWho are you? a man\nof the country, who was born for the\npurpose of having been laid. We're to-day\nthe present occasion. The number of votes\nwas given up to the Bank from the South\nCarolina in the county of War. In which they\nhad taken the whole. The proceedings were\nallowed to produce from this great mass. This\nwere one of the most unnecessarily, and the\nquestion ensued on the part of the company.\n\n--- Prompt 3 ---\nUser: Greetings, my friend\n\nModel:\nGreetings, my friend of her Ma t'estl. In all the other nations of his life and honour,\nhe would make a more than an instant, to any such thing as that has ever been done in the world.\nIt is impossible to expect that this country is in a condition in it...\n\n--- Prompt 4 ---\nUser: What's this place?\n\nModel:\nWhat's this place?\nthe first and most important to which the State, and in its opinion, has been the great\nand valuable to us. The Legislature of Russia was as follows :—\" That the House of\nMr. H. M'Neard (Mr. S. H.) in Ireland was referred to an early period of money in\n```\n\n### Step 1000\n\nThe model learns to connect some words together.\n\n```\n--- Prompt 1 ---\nUser: Hello!\n\nModel:\nHello! CIR. and HI, RAJUy\nDESTROYED BUSINESS.\nBy Mr. j. r. FULLER, and Messrs. A. & J.\nMclean and J. E. Hubbard, MIR G. P. Roulton,\nMr. G. w. Harris of the firm Of E. +. Dexter,\nand J. A. Brown OF the firm of\n\n--- Prompt 2 ---\nUser: Who are you?\n\nModel:\nWho are you? — but God has a child, who wants to be the same, and he is a most excellent man. It has been an ornament of manners for a lady and an officer. I have always observed that the author does not write to me. — I am no stranger, my good friend ; but I shall look over his life with pleasure, if I do not be a stranger. As a stranger I believe, it will be my duty to speak to the young lady, the stranger, if not\n\n--- Prompt 3 ---\nUser: Greetings, my friend\n\nModel:\nGreetings, my friend, but in the course of an old acquaintance. — It is rather remarkable how much your lordship seems to be the manu facture of this most noble family, that you have so many a favouritewith me. — But this is not all I can conceive of; for it seems to me, that even this great object of that kind is to procure the benefit of our religious institutions to which I have referred. — The only difference be tween your own personal merit and his attachment. It is\n\n--- Prompt 4 ---\nUser: What's this place?\n\nModel:\nWhat's this place?\nHe said there was no danger of such occurrence.\nThe plaintiff, the defendant, is a man of very respectable\ncharacter, and who has been in the service of the firm\nof Messrs. Barrett and Co. for a period of three years, at\nthe rate of 50 per cent. on the gross value of his estate in\nwhich he resided in the year 1830. He had for many\nyears previously kept an account of the business of the bank,\nand it appeared\n```\n\n### Step 10,000 (end of stage-1)\n\nThis model generates clean text because I did a very light fine-tuning with my [CommonSense dataset](https://huggingface.co/datasets/croqaz/commonsense-v1) at this point.\n\n```\n--- Prompt 1 ---\nUser: Hello!\n\nModel:\nHello! What a fine idea ! How many legs is this ? It is a sort of animal ; that's the beast. Oh, what is it? What is it? The horse has no tail. But it's a brute, and I am afraid I cannot find a tail in it. A horse has no tail, and he eats it, and eats it. And how is it? Oh, how much does it take to eat it? Oh, how much do they eat? And the horse is\n\n--- Prompt 2 ---\nUser: Who are you?\n\nModel:\nWho are you?\nThen, what is the difference between a man and his wife?\n\n--- Prompt 3 ---\nUser: Greetings, my friend\n\nModel:\nGreetings, my friend has been in the habit of walking about the streets for the last two or three days. He has walked to and fro in all directions in search of a place where he could purchase some articles at a cheap rate. This is not true, as you know; but it seems that you have been thinking about a place where you could buy a few articles at wholesale prices. This is a strange way of doing business, and you are often told that a man may make his goods cheap by buying them cheap\n\n--- Prompt 4 ---\nUser: What's this place?\n\nModel:\nWhat's this place? Do you see it ? Do you see it ? I have seen it. Do you see it ? Is the tree that bears its fruit, and the fruit which is in the tree? Do you see it ? Are there trees that stand upon the ground? Is there any other trees of the same sort? Do you see them ? DO you see them ? Do you see them? Do you see them? Are they more common? Do you see them ? What are those ? Do you see them?\n```\n\n### Step 20,000 (end of stage-2)\n\nAt this stage the text is semi-random, but it's consistent. The whole chunk of text seems to be from the same book, which is great to see.\n\nNo fine-tuning.\n\n```\n--- Prompt 1 ---\nUser: Hello!\n\nModel:\nHello!” cried Mr. Dyer, “ the man is a\nfoolish fellow ; he must be made of some use.”\n“ But I do not think so, sir,” said the lawyer, “ I\nwill tell you a story in one of my books. The case\nof the old gentleman is this : — He had been living in\na strange country. His wife died, and his son took\ncharge of him. He did not\n\n--- Prompt 2 ---\nUser: Who are you?\n\nModel:\nWho are you?”\nThe detective was standing by the table, with his\nback toward the door.\n“ You are going to get up that row?” he said.\n‘‘Yes,” he replied.\n“And go to get it done?”\n“No.”\n“Then we will see about it.”\n\n--- Prompt 3 ---\nUser: Greetings, my friend\n\nModel:\nGreetings, my friend, and I will give you the\nadvice of an honest man.\n\" The above is a literal translation : I do not doubt it ; but\nI do not know whether the reader has read it himself.\n\" P. S. — I have just returned from the country. I am very\nsorry to say I had rather be in the house myself than in any one's\nhouse. If you please to remember me, sir, it shall be your\n\n--- Prompt 4 ---\nUser: What's this place?\n\nModel:\nWhat's this place?\nThe devil himself is the devil !\nTHE VOICE OF THE HEART. 53\nVOICE OF THE HEART.\nI wish that I had never been born !\nI'll never tell the tale of my sin !\nI'll never tell it !\nBut, ah ! the words are full of pain ;\nThey sound as if they had grown more plain :\n```\n\n### Next steps\n\nI don't have a good instruct model yet and I wont't share something half-baked, but I have plenty of datasets that I'm working on:\n\n- my\n[CommonSense dataset](https://huggingface.co/datasets/croqaz/commonsense-v1)is expanding, I have ~7k rows prepared and more ideas to add in there; - I have a list of quotes from philosophers like: Aristotle, Freud, Plato, etc and I'll get them in shape for Q&A pairs;\n- I extracted clean questions and answers from several pre 1900 books like:\n[The new common school question book 1888](https://archive.org/details/newcommonschoolq01crai)[The advanced question book 1883](https://archive.org/details/advancedquestion00sout)[A complete dictionary of synonyms and antonyms 1835](https://archive.org/details/completedictiona00falluoft), etc ;- I didn't process all of them yet, but I'll publish them as a separate dataset.\n\n- I have old dictionaries that I plan to extract word definitions from, and I'll convert them to more Q&A ;\n- I want to extract different topics from King James Bible to create more Q&A pairs;\n- I also have\n[Talkie-13B-Q8-GGUF](https://huggingface.co/zakarth/talkie-1930-13b-it-vulkan-fixed-GGUF)and[Monad-GPT-Q8-GGUF](https://huggingface.co/TheBloke/MonadGPT-GGUF)! I'm sure I can generate a few synth datasets.\n\nThese datasets, or part of them at least should help the model understand the dialog structure. I'll experiment.\n\nI will write another post once I have a good enough model to share.\n\n### Finis\n\nWhen I started this project, I intentionally didn't define \"success\". I didn't start with a clear goal. It was a learning project first of all.\n\nI was only hoping to get some kind of usable Victorian dialog out of it.\n\nBut I was surprised how bad it is without any fine-tuning, after spending so much time and energy on it... therefore I want to use all my creative energies to create something decent enough to share.\n\nWhat do you think?\n\nAre you interested in this kind of work?\n\nWhat would you use this kind of LLM for?\n\nWhat would you like me to try, in this context of Vintage LLMs?\n\nIf you want to reach out, you have a few ways:\n\n- leave a comment here ;\n- leave a comment on HF discussions:\n[https://huggingface.co/croqaz/vintage-LLM-340m-v1-base/discussions](https://huggingface.co/croqaz/vintage-LLM-340m-v1-base/discussions); - raise an issue on Vintage-LLM repo:\n[https://github.com/croqaz/vintage-LLM/issues](https://github.com/croqaz/vintage-LLM/issues); - join our\n[Vintage-LLM Discord Server](https://discord.gg/MVzzuzGXhk)and send a message.\n\nIf you want to reach out in private, my e-mail is *my short name* at *this domain*.**If you're a cloud GPU provider, you have some GPUs collecting dust and want to sponsor the training of my next model, do reach out**! I will use all the compute for a bigger 1B model, with a better dataset; or I can base-train this 340M model up to 50,000 steps, I have plenty of data, your choice!\n\nIf you want to read more about these kinds of LLMs, check my previous post: [Vintage LLM models](/log/entries/260428-1/), lots of links.\n\nI encourage anyone who is curious or hyped about AI and LLMs to do what I did, and play with it along the way between training stages, to get a peek behind the curtain. You'll forget about AGI in no time, but it will be fun, I promise!\n\nMaybe find a twist in your process, something that you didn't see anyone else do, perhaps some unique dataset, or unique training method, or LLM architecture? But you don't have to, you'll learn new things anyway. I know I did!\n\nNote: This blog post is 100% written by me. No AI has been used whatsoever.", "url": "https://wpnews.pro/news/making-a-vintage-llm-from-scratch", "canonical_source": "https://crlf.link/log/entries/260525-1/", "published_at": "2026-06-11 08:38:00+00:00", "updated_at": "2026-06-12 00:43:36.726485+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "natural-language-processing", "ai-research"], "entities": ["Hayk Grigorian", "HuggingFace", "croqaz", "vintage-LLM-340m-v1-base", "vintage-LLM", "Reddit", "LocalLLaMA", "London"], "alternates": {"html": "https://wpnews.pro/news/making-a-vintage-llm-from-scratch", "markdown": "https://wpnews.pro/news/making-a-vintage-llm-from-scratch.md", "text": "https://wpnews.pro/news/making-a-vintage-llm-from-scratch.txt", "jsonld": "https://wpnews.pro/news/making-a-vintage-llm-from-scratch.jsonld"}}