{"slug": "playing-chess-with-large-language-models", "title": "Playing chess with large language models", "summary": "According to Nicholas Carlini's 2023 article, while computers have surpassed humans in chess for decades using specialized game-playing models, OpenAI's GPT-3.5-turbo-instruct—a language model designed only to generate English text—was unexpectedly discovered to play chess at the level of skilled human players. The model achieves this by processing the game's PGN notation and predicting the next move, effectively maintaining a \"world model\" of the chessboard in its activations without being explicitly programmed with the rules. Carlini built a Python wrapper to connect the model to a Lichess bot, noting that the model plays in a human-like manner, even making strategic but flawed attacks.", "body_md": "by Nicholas Carlini 2023-09-22\n\nComputers have been better than humans at chess for at least the last\n25 years. And for the past five years, deep learning models have been\nbetter than the best humans. But until this week, in order to\nbe good at chess, a machine learning model had to be\n**explicitly designed** to play games: it had to be told explicitly\nthat there was an 8x8 board, that there were different pieces,\nhow each of them moved, and what the goal of the game was.\nThen it had to be trained with reinforcement learning agaist itself.\nAnd then it would win.\n\nThis all changed on Monday, when OpenAI released GPT-3.5-turbo-instruct,\nan instruction-tuned\n\nAn [instruction-tuned model](https://arxiv.org/abs/2203.02155)\nis one that's been *aligned* to follow human\ninstructions. Basically: “do the right thing”.\nlanguage model that was designed to just write English text,\nbut that people on the internet quickly\ndiscovered can play chess at, roughly, the level of skilled human\nplayers. (How skilled? I don't know yet. But when I do I'll update this!)\n\nYou should be very surprised by this. Language models ... *model language*.\nThey're not designed to play chess.\nThey don't even know the rules of chess!\nGPT-4, for example, can't even play a full game against me without\nmaking a few illegal moves. GPT-3.5-turbo-instruct can beat me.\n\nI (very casually) play chess and wanted to test how well the model\ndoes. But instead of prompting the model and copying the moves over one\nby one, I built a small python wrapper around the model that connects\nit to any UCI-compatible chess engine. Then I hooked this into a\n[Lichess](https://lichess.org) bot. If you're reading this\nnot too far in the future, and my bot is still running, you can\n\n(As long as you have a lichess account. (Which if you play chess you should. Lichess is fully open source and amazing.) And as long as no one else is currently playing it, then you'll have to wait a bit.)\n\nBelow is a game where one of my coworkers beat the model and shows what I find most interesting about how it plays: Like it was a human! It gets into a strong position, and then goes for what it thinks is a mating attack by sacrificing its rook on move 21. Rxg5+. The attack looks threatening, but it doesn't work.\n\nI've also released full source code for my bot at [this github repo](https://github.com/carlini/chess-llm).\n\nLet's get started with how you make a language model play chess.\nAs you might know, language models just predict the next word in a\nsentence given what's come before.\nSo to get it to play chess, all you do is\npass it\nthe [PGN](https://en.wikipedia.org/wiki/Portable_Game_Notation)\nnotation of ever moves that been played so far, and ask it to predict the\nnext move that will happen.\n\nFor example, supposing I wanted the model to play as black and respond to a\nstandard Kings Pawn opening, I would feed the model the following PGN game.\n\n(I've told the model the game is between two of the best chess players of all\ntimes, and told it their ratings are 2900 and 2800---very high. Does this matter?\nActually not that much. But this is a blog post not a research paper so I'm\njust not going to do an ablation study here.)\n\nand then ask it to generate a predicted next word. The model replies,\nin this case, with `e5`\n\n, which is the standard response to\nthe Kings Pawn opening. Then, if I want the model to play the next move\nafter that, I again feed it the new history of moves ` 1. e4 e5 2.`\n\nand ask it for its next prediction.\n\nNow you might think “well that's cheating it's probably seen this position a million times of course it will get that right”. And you would be right in this case. But I've played a few dozen games against it now, in positions that have never occurred online before, and it still plays remarkably well.\n\nLet's take a moment to be being truly amazed at what's happening. Somehow, the\nmodel is maintaining a “world model” of the chess board\nin its activations. And every time it has to generate a new move, it has to\nreplay the entire sequence of moves that have happened so far, and then\npredict the next move. And it's doing this all well enough to not\njust make *valid* moves, but to make *good* moves.\n\nAnd even making valid moves is hard! It has to know that you can't move a piece when doing that would put you in check, which means it has to know what check means, but also has to think at least a move ahead to know if after making this move another piece could capture the king. It has to know about en passant, when castling is allowed and when it's not (e.g., you can't castle your king through check but your rook can be attacked). And after having the model play out at least a few thousand moves it's so far never produced an invalid move.\n\nBut anyway: why does the model learn to play good chess? I honestly have no idea. All I can offer is the simplistic speculation you've probably already thought of: Because most chess games that people post on the internet are high quality, “predict the next word” happens to align pretty well with “play a good move”. And so the model just plays good moves because that's what's been seen most often.\n\nIt's important to remember, though, that this model *is not playing to win*.\nIt's playing to maximize the likelihood of the PGN text that's been provided.\nUsually that means *“play high quality moves”* because most\nPGNs online are high quality games.\n\nBut this is not always the case. Let's consider the below example:\n\nHere, I've prompted the model with the first three moves of the\n[Bongcloud Attack](https://en.wikipedia.org/wiki/Bongcloud_Attack),\na joke opening where white moves\ntheir king out on the second move. This is a terrible move. You should\nnever do this if you want to win.\n\nA “good” response to the Bongcloud is to just develop\nyour pieces and play good chess. What does the model do here?\nIT PLAYS THE DOUBLE BONGCLOUD! One of the worst replies possible.\nIt does this, I guess, because when people\nplay the Bongcloud, it's not usually a serious game and their\nopponent will then play the Bongcloud back. (For example,\n[as Magnus Carlsen\nand Hikaru Nakamura did in truly an amazing game](https://www.youtube.com/watch?v=zVCst6vyV80).\n\nThis distinction is important, and is why I think this model is most interesting to me. It “feels” much more human than any other chess engine I've played against. Because it's just playing what humans play.\n\nThere is some research that suggests language models do actually\nlearn to represent the game *in memory*. For example here's\n[one of my favorite papers recently](https://arxiv.org/abs/2210.13382)\nthat shows that a language model trained on Othello moves can\nlearn to represent the board internally. But Othello is a much\nsimpler game than chess, because pieces don't move in crazy ways.\n\nAnother consequence of the model being a text-model is that sometimes it's hard to turn the board into a string of text that can be fed into the model.\n\nFor example, the closest I got to the model producing\nan invalid move was in a game where one of my coworkers had the option to\ncastle either kingside (denoted in PGN by `O-O`\n\n) or queenside (denoted by `O-O-O`\n\n).\nHe castled kingside, and the model's predicted next move given the\ncontext `30. O-O`\n\nwas the continuation `-O`\n\n! That is, the model\nwas just saying “I think you should have castled queenside”.\n\nThe reason why it did this is because there's no space separating the end of the previous move from the beginning of the generation, and it hadn't occured to me that there was any valid move that was a prefix of another valid move. And getting a bit technical: the reason you don't want to insert a space after the last move is that the way the language model tokenizer works spaces are inserted before words, not after. So in general you shouldn't have trailing spaces. (I fixed this by, in this one specific case, adding a trailing space.)\n\nSo I want to test the model somewhat objectively. So I decided to have it\ntry to solve some tactics puzzles. In some sense this is what I expect should\nbe hardest for the model. (Because, remember, *it's not doing any lookahead\nit's just predicting the next word*.)\n\nTo do this I'll use the\n[Lichess puzzle database](https://database.lichess.org/#puzzles),\na collection of 3.5 million puzzles from real games in the following format:\n\nYou may notice\nthere's one problem. The puzzles only have the current board state (encoded as [FEN](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation)),\nnot the full PGN history. And the language model is only good when operating\non the full game text.\n\nFortunately though, it *does* have the Lichess game that the puzzle was taken\nfrom. And also, fortunately, there is\n[a database of all games](https://database.lichess.org/#standard_games) played on Lichess.\nSo all I have to do is associate each puzzle with the game it came from,\nextract the PGN from the game, and then query the model on the PGN.\n\nNow in practice there are a lot of games. And there are a lot of puzzles. So instead of downloading every single game, I just extract a very small subset of the games (roughly 0.1%) and then build an index of which games I have an intersect each of these with each puzzle. This gives me several thousand puzzles which is more than enough to get the statistical power I need.\n\nSo let's query the model. I built a small driver to feed the model the initial state and get a move, and then repeatedly play out the opposing move and ask the model for its next move. The model passes the puzzle if, just as a human, it gets it perfectly correct. Invalid moves or incorrect moves fail the puzzle.\n\nHere's a plot of how well it does for puzzles as I increase the *puzzle rating* from\n400 to 2400. (Puzzle rating is calculated by Lichess based on, roughly, how often people\nget the puzzle correct.)\n\nLet's start off with an example where the model actually gets it right. In this 2600 rated puzzle, after 31. Ne3 with the triple fork, the model finds Qxe5 which looks completely losing because it's defended, but actually wins because the pawn on d6 is basically pinned to prevent the backrank mate.\n\nBut equally interesting, there are some truly trivial puzzles the model gets wrong.\nIn the following position, instead of doing the immediately obvious “PUSH THE PAWN”\nthe model decides to play `39... h5`\n\n. wat.\n\nSo I said above that the only thing the model was doing was modeling the language,\nand not trying to win. What does this **really** mean? Well, let's try something\nfun. The same game board can be reached through many different move sequences.\nFor example, the following two final board positions are identical, but if you step\nthrough you'll see that the move sequences that generated them are (very!!) different.\n\nThis first game is the actual game that was played.\n\nAnd this second game another completely legal, but not very likely, game that could have generated this same final board. (You mean you don't normally hang your queen to a pawn while trying to promte your pawn to a knight only to give it away?)\n\nHow do I generate these move sequences for alternate paths to get to the same game?\nIn general this is a very hard problem. Fortunately someone's already solved it for me.\nSo I just call [proofgame](https://github.com/peterosterlund2/texel/blob/0cd3e3dc43489238ec5a9502a1743f086f161f53/doc/proofgame.md)\nand it gives me the answer. How does it work? Honestly no idea. Probably black magic.\nBut verifiying that it does work is easy and I did that.\n\nNow let's ask the following question: how well does the model solve chess positions when when given completely implausible move sequences compared to plausible ones?\n\nAs we can see at right **it's only half as good!**\nThis is very interesting. To the best of my knowledge there aren't any\nother chess programs that have this same kind of stateful behavior,\nwhere *how* you got to this position matters.\n\nThis suggests something interesting, too: the model might actually be adapting on-the-fly to the skill of the opponent. If the opponent plays weird moves that don't make sense, it might be more likely to “believe” that this PGN game is between two lower rated players and therefore it should produce opponent moves that are more likely to be played by lower rated players.\n\nThere is an alternate explanation, however. Maybe what we're doing here by producing a confusing sequence of moves is actually confusing the model. As in---maybe we've broken its internal world model and now it doesn't “know” what the board looks like as well and so can't play as well.\n\nI was definitely one of those “language models can't world model” people for a while.\nAfter reading [the Othello paper](https://arxiv.org/abs/2210.13382) mentioned earlier\nI was sort of convinced that maybe they could. But actually playing chess (and losing) against what I know\nto be a language model was very surreal. I don't know how to feel about this.", "url": "https://wpnews.pro/news/playing-chess-with-large-language-models", "canonical_source": "https://nicholas.carlini.com/writing/2023/chess-llm.html", "published_at": "2023-09-22 00:00:00+00:00", "updated_at": "2026-05-19 22:12:50.223677+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "research"], "entities": ["OpenAI", "GPT-3.5-turbo-instruct", "GPT-4", "Nicholas Carlini"], "alternates": {"html": "https://wpnews.pro/news/playing-chess-with-large-language-models", "markdown": "https://wpnews.pro/news/playing-chess-with-large-language-models.md", "text": "https://wpnews.pro/news/playing-chess-with-large-language-models.txt", "jsonld": "https://wpnews.pro/news/playing-chess-with-large-language-models.jsonld"}}