{"slug": "reverse-engineering-codemasters-bigf-archive-format-in-ruby", "title": "Reverse-engineering Codemasters' BIGF archive format in Ruby", "summary": "A developer reverse-engineered Codemasters' BIGF archive format using pure Ruby, demonstrating that Ruby's binary string handling and String#unpack method can efficiently parse game data from TOCA Race Driver and three other titles. The project relied on AI-assisted coding but required manual verification of all byte-level claims, highlighting Ruby's suitability for low-level binary parsing despite its typical association with web development.", "body_md": "The Engineer's Notebook\n\n# Reading a Binary Game Format in Ruby\n\n## On this page\n\nWhen you say “I’m going to reverse-engineer a binary file format,” people picture C,\nor Python with `struct`\n\n, or Kaitai. Nobody pictures Ruby. Ruby is for web apps and\nDSLs and being pleasant; it is not, in the popular imagination, for byte-banging\nfloats out of a 2003 racing game.\n\nThat popular imagination is wrong. The reader for Codemasters’ BIGF archive format —\nthe container that holds the AI data in TOCA Race Driver — is pure, dependency-free\nRuby, and it reads four different games’ archives. I should be upfront about how it\ncame to be: this was reverse engineering done *with* an AI the whole way — me\nsteering, deciding what to trust and verifying every claim against the bytes; the\nmodel drafting code, recalling the corners of the standard library, and proposing\nhypotheses I then tested. What follows is the part of Ruby that made that\ncollaboration genuinely pleasant: **Ruby strings are byte buffers, and\nString#unpack is a tiny, fast binary parser hiding in plain sight.**\n\n## Strings are bytes\n\nThe first thing to internalise is that a Ruby `String`\n\nis not “text.” It’s a\nsequence of bytes with an encoding label attached. Read a file in binary mode and\nyou get the raw bytes, indexable and sliceable like any string:\n\n```\ndata = File.binread(\"aib.big\")   # the whole file as an ASCII-8BIT String\ndata[0, 4]                        # => \"BIGF\"  — the first four bytes\ndata.bytesize                     # => 3448832\n```\n\n`File.binread`\n\nis the key: it reads the file as binary (`ASCII-8BIT`\n\n/ `BINARY`\n\nencoding), so no UTF-8 interpretation mangles your `0x80`\n\n+ bytes. From there,\n`data[offset, length]`\n\ncarves out byte ranges, and `data.index(needle, from)`\n\nfinds\na magic number or a marker anywhere in the file. That’s most of a parser already.\n\n`unpack`\n\n: the binary decoder you already have\n\nThe workhorse is `String#unpack`\n\n(and its single-value sibling `unpack1`\n\n). You hand\nit a *format string* of directives and it decodes the bytes. The two directives that\ndid 90% of the work here:\n\n— an unsigned 32-bit integer,`V`\n\n**little-endian**. Every count, block index, offset and size in BIGF is a`V`\n\n.— a`e`\n\n**little-endian single-precision float**(32-bit). The AI data is arrays of these: the racing-line coordinates, the control values, the padding.\n\n``` js\ndata[4, 4].unpack1(\"V\")      # => 39        — the entry count, as a u32 LE\ndata[12, 16].unpack(\"e4\")    # => [0.0, 0.0, 137.0, 0.0]   — four float32s\n```\n\nEndianness lives in the directive, which is the whole game: `V`\n\nis little-endian\nu32, `N`\n\nis big-endian; `e`\n\nis little-endian float, `g`\n\nis big-endian. Codemasters’\nPC games are little-endian, so it’s `V`\n\nand `e`\n\nthroughout. (When we later looked at\nan Xbox 360 file, big-endian PowerPC, it would have been `N`\n\nand `g`\n\n— the format\nstring is the only thing that changes.)\n\n`unpack`\n\nis implemented in C inside the interpreter, so decoding a few hundred\nthousand floats is not slow. You are not paying a “scripting language” tax here.\n\n## Walking the container\n\nBIGF is a header, a directory, and a data section. The header check is a one-liner:\n\n```\nMAGIC = \"BIGF\".b\nraise \"not a BIGF archive\" unless data[0, 4] == MAGIC\n```\n\nThat `.b`\n\nis worth a footnote: it returns a binary copy of the string literal, so\nthe comparison is byte-for-byte regardless of source-file encoding. I use it for\nevery binary constant.\n\nBIGF has two directory layouts. One is a flat table of **fixed 24-byte records** —\n`char name[16]; u32 size; u32 offset`\n\n— which is a textbook `unpack`\n\nloop:\n\n```\ncount = data[4, 4].unpack1(\"V\")\nbase  = data[8, 4].unpack1(\"V\")    # data-section base, read from the header (not assumed!)\noff   = 0x24                        # records start after the 0x20 header + a 4-byte pad\n\ncount.times do\n  rec = data[off, 24]\n  name          = rec[0, 16].split(\"\\x00\").first.to_s   # NUL-terminated name field\n  size, offset  = rec[16, 8].unpack(\"V2\")               # two u32s in one go\n  members << Entry.new(name:, offset: base + offset, size:)\n  off += 24\nend\n```\n\nThree small Ruby niceties are doing real work there. `rec[0, 16].split(\"\\x00\").first`\n\nturns a fixed-width, NUL-padded C string into a Ruby string. `unpack(\"V2\")`\n\npulls\n*two* integers at once (the count suffix). And — a hard-won detail — `base`\n\nis read\nfrom the header field at `0x08`\n\nrather than hard-coded, because measuring 1,371 real\nfiles showed it isn’t always the `0x800`\n\neveryone assumes.\n\nThe other layout is variable-length: names interspersed with a `0x44 00 00 00`\n\nmarker. That’s where `String#index`\n\nshines — you scan for the extension, walk back to\nthe preceding NUL to find the name’s start, then look just past it for the marker:\n\n```\nwhile (idx = data.index(\".aib\", pos)) && idx < limit\n  s = idx\n  s -= 1 while s.positive? && data.getbyte(s - 1) != 0   # walk back to the NUL\n  name = data[s...(idx + 4)]\n  # ...marker + block index follow the name...\n  pos = idx + 4\nend\n```\n\n`getbyte`\n\nreads a single byte as an integer without allocating a substring — exactly\nwhat you want in a tight backwards scan.\n\n## Decoding the records inside\n\nCarving a member is just a slice — `data[entry.offset, entry.size]`\n\n— and Ruby\nslicing is *safe*: ask for bytes past the end of the file and you get a short string\nor `nil`\n\n, never a crash. Inside an AI profile, every 16 bytes is four float32s, and\nthe parser classifies each record by its bit pattern:\n\n``` js\nSENTINEL   = \"\\x3f\\x3f\\x3f\\x3f\".b.unpack1(\"e\")        # => 0.7470588…  (the padding value)\nKTAG_MAGIC = \"\\x0c\\x00\\x00\\x00\\x08\\x00\\x00\\x00\".b\n\ndef classify(bytes)\n  return [:ktag, bytes[8, 4].unpack1(\"e\")] if bytes[0, 8] == KTAG_MAGIC\n\n  a, b, c, d = bytes.unpack(\"e4\")\n  if [a, b, c, d].all? { |x| (x - SENTINEL).abs < 1e-5 } then :pad\n  elsif a.zero? && b.zero? && c.zero? && d.zero?         then :zero\n  elsif b.zero? && d.zero?                               then :scalar   # (v,0,v,0)\n  elsif [a,b,c,d].all? { |x| coordish?(x) }              then :path     # (x,y,x,y)\n  else :other\n  end\nend\n```\n\nThat `SENTINEL`\n\nline is a small joy: nobody had to look up “what float is\n`0x3f3f3f3f`\n\n?” in a calculator — we let Ruby tell us by unpacking the four bytes\n(`0.7470588…`\n\n). The classifier then reads almost like prose, which matters when the\nprose *is* the format specification you’re trying to pin down.\n\nOne genuine gotcha lives in `coordish?`\n\n: some 16-byte records, read as floats, are\ndenormals or `NaN`\n\n. Ruby’s `Float#nan?`\n\nand a magnitude check handle it cleanly —\nbut you have to remember that `x == x`\n\nis `false`\n\nfor `NaN`\n\n, so the guard is\n`!x.nan? && x.abs < 1e30 && …`\n\nrather than a naive comparison. (RuboCop will even nag\nyou toward `nan?`\n\nif you write the `x == x`\n\ntrick.)\n\n## Why Ruby, specifically\n\nHaving done this, the case for Ruby on a binary-RE task is concrete:\n\n**Strings-as-buffers + slicing** make navigation ergonomic — no cursor object, no read/seek ceremony, just`data[off, len]`\n\n.with a one-character vocabulary for every integer and float width and endianness.`unpack`\n\nis a complete, fast, C-backed binary decoder**Zero dependencies.** The whole reader is standard library. A research tool that has to run on a stranger’s machine in five years should not depend on a gem whose API has since drifted.**It reads like the spec.** When the code that classifies a record is short enough to hold in your head, the code*becomes*your documentation of the format — which is the entire point of reverse-engineering.**The REPL closes the loop.** During the actual work,`irb`\n\nwith`File.binread`\n\nand a one-line`unpack`\n\nis the fastest way to ask “what is at offset`0x5c00`\n\n?” and get an answer before the thought has finished.\n\nThe gotchas are few and all about staying in binary-land: read with `binread`\n\n, write\nbinary constants with `.b`\n\n, get the endianness directive right (`V`\n\n/`e`\n\n, not `N`\n\n/`g`\n\n),\nuse `unpack1`\n\nwhen you want one value instead of an array, and treat `NaN`\n\nwith\nrespect. None of them are Ruby’s fault; they’re just what binary is.\n\nA 2003 racing game’s AI, a four-byte magic, a table of offsets, and a few hundred\nthousand little-endian floats — all read by twenty lines of standard-library Ruby. The\nlanguage people use for `has_many :comments`\n\nturns out to be a perfectly good\ndisassembler’s notebook — and, paired with an AI that never tires of unpacking the\nnext sixteen bytes, a fast one.\n\n### Where to look\n\nThe full reader is open source — `String#unpack`\n\nin anger across two table layouts and\nfour games:\n\n**Repository:**(MIT)`github.com/davidslv/bigf`\n\n- The container parser:\n`lib/bigf/archive.rb`\n\n· the record decoder:`lib/bigf/toca/profile.rb`", "url": "https://wpnews.pro/news/reverse-engineering-codemasters-bigf-archive-format-in-ruby", "canonical_source": "https://davidslv.uk/2026/06/30/reading-binary-in-ruby.html", "published_at": "2026-06-30 17:10:26+00:00", "updated_at": "2026-06-30 17:20:48.048234+00:00", "lang": "en", "topics": ["developer-tools"], "entities": ["Codemasters", "Ruby", "TOCA Race Driver", "BIGF", "String#unpack", "File.binread"], "alternates": {"html": "https://wpnews.pro/news/reverse-engineering-codemasters-bigf-archive-format-in-ruby", "markdown": "https://wpnews.pro/news/reverse-engineering-codemasters-bigf-archive-format-in-ruby.md", "text": "https://wpnews.pro/news/reverse-engineering-codemasters-bigf-archive-format-in-ruby.txt", "jsonld": "https://wpnews.pro/news/reverse-engineering-codemasters-bigf-archive-format-in-ruby.jsonld"}}