The Engineer's Notebook
On this page #
When you say “I’m going to reverse-engineer a binary file format,” people picture C,
or Python with struct
, or Kaitai. Nobody pictures Ruby. Ruby is for web apps and DSLs and being pleasant; it is not, in the popular imagination, for byte-banging floats out of a 2003 racing game.
That popular imagination is wrong. The reader for Codemasters’ BIGF archive format — the container that holds the AI data in TOCA Race Driver — is pure, dependency-free Ruby, and it reads four different games’ archives. I should be upfront about how it came to be: this was reverse engineering done with an AI the whole way — me steering, deciding what to trust and verifying every claim against the bytes; the model drafting code, recalling the corners of the standard library, and proposing hypotheses I then tested. What follows is the part of Ruby that made that collaboration genuinely pleasant: Ruby strings are byte buffers, and String#unpack is a tiny, fast binary parser hiding in plain sight.
Strings are bytes #
The first thing to internalise is that a Ruby String
is not “text.” It’s a sequence of bytes with an encoding label attached. Read a file in binary mode and you get the raw bytes, indexable and sliceable like any string:
data = File.binread("aib.big") # the whole file as an ASCII-8BIT String
data[0, 4] # => "BIGF" — the first four bytes
data.bytesize # => 3448832
File.binread
is the key: it reads the file as binary (ASCII-8BIT
/ BINARY
encoding), so no UTF-8 interpretation mangles your 0x80
- bytes. From there,
data[offset, length]
carves out byte ranges, and data.index(needle, from)
finds a magic number or a marker anywhere in the file. That’s most of a parser already.
unpack
: the binary decoder you already have
The workhorse is String#unpack
(and its single-value sibling unpack1
). You hand it a format string of directives and it decodes the bytes. The two directives that did 90% of the work here:
— an unsigned 32-bit integer,V
little-endian. Every count, block index, offset and size in BIGF is aV
.— ae
little-endian single-precision float(32-bit). The AI data is arrays of these: the racing-line coordinates, the control values, the padding.
data[4, 4].unpack1("V") # => 39 — the entry count, as a u32 LE
data[12, 16].unpack("e4") # => [0.0, 0.0, 137.0, 0.0] — four float32s
Endianness lives in the directive, which is the whole game: V
is little-endian
u32, N
is big-endian; e
is little-endian float, g
is big-endian. Codemasters’
PC games are little-endian, so it’s V
and e
throughout. (When we later looked at
an Xbox 360 file, big-endian PowerPC, it would have been N
and g
— the format string is the only thing that changes.)
unpack
is implemented in C inside the interpreter, so decoding a few hundred thousand floats is not slow. You are not paying a “scripting language” tax here.
Walking the container #
BIGF is a header, a directory, and a data section. The header check is a one-liner:
MAGIC = "BIGF".b
raise "not a BIGF archive" unless data[0, 4] == MAGIC
That .b
is worth a footnote: it returns a binary copy of the string literal, so the comparison is byte-for-byte regardless of source-file encoding. I use it for every binary constant.
BIGF has two directory layouts. One is a flat table of fixed 24-byte records —
char name[16]; u32 size; u32 offset
— which is a textbook unpack
loop:
count = data[4, 4].unpack1("V")
base = data[8, 4].unpack1("V") # data-section base, read from the header (not assumed!)
off = 0x24 # records start after the 0x20 header + a 4-byte pad
count.times do
rec = data[off, 24]
name = rec[0, 16].split("\x00").first.to_s # NUL-terminated name field
size, offset = rec[16, 8].unpack("V2") # two u32s in one go
members << Entry.new(name:, offset: base + offset, size:)
off += 24
end
Three small Ruby niceties are doing real work there. rec[0, 16].split("\x00").first
turns a fixed-width, NUL-padded C string into a Ruby string. unpack("V2")
pulls
two integers at once (the count suffix). And — a hard-won detail — base
is read
from the header field at 0x08
rather than hard-coded, because measuring 1,371 real
files showed it isn’t always the 0x800
everyone assumes.
The other layout is variable-length: names interspersed with a 0x44 00 00 00
marker. That’s where String#index
shines — you scan for the extension, walk back to the preceding NUL to find the name’s start, then look just past it for the marker:
while (idx = data.index(".aib", pos)) && idx < limit
s = idx
s -= 1 while s.positive? && data.getbyte(s - 1) != 0 # walk back to the NUL
name = data[s...(idx + 4)]
pos = idx + 4
end
getbyte
reads a single byte as an integer without allocating a substring — exactly what you want in a tight backwards scan.
Decoding the records inside #
Carving a member is just a slice — data[entry.offset, entry.size]
— and Ruby
slicing is safe: ask for bytes past the end of the file and you get a short string
or nil
, never a crash. Inside an AI profile, every 16 bytes is four float32s, and the parser classifies each record by its bit pattern:
SENTINEL = "\x3f\x3f\x3f\x3f".b.unpack1("e") # => 0.7470588… (the padding value)
KTAG_MAGIC = "\x0c\x00\x00\x00\x08\x00\x00\x00".b
def classify(bytes)
return [:ktag, bytes[8, 4].unpack1("e")] if bytes[0, 8] == KTAG_MAGIC
a, b, c, d = bytes.unpack("e4")
if [a, b, c, d].all? { |x| (x - SENTINEL).abs < 1e-5 } then :pad
elsif a.zero? && b.zero? && c.zero? && d.zero? then :zero
elsif b.zero? && d.zero? then :scalar # (v,0,v,0)
elsif [a,b,c,d].all? { |x| coordish?(x) } then :path # (x,y,x,y)
else :other
end
end
That SENTINEL
line is a small joy: nobody had to look up “what float is
0x3f3f3f3f
?” in a calculator — we let Ruby tell us by unpacking the four bytes
(0.7470588…
). The classifier then reads almost like prose, which matters when the prose is the format specification you’re trying to pin down.
One genuine gotcha lives in coordish?
: some 16-byte records, read as floats, are
denormals or NaN
. Ruby’s Float#nan?
and a magnitude check handle it cleanly —
but you have to remember that x == x
is false
for NaN
, so the guard is
!x.nan? && x.abs < 1e30 && …
rather than a naive comparison. (RuboCop will even nag
you toward nan?
if you write the x == x
trick.)
Why Ruby, specifically #
Having done this, the case for Ruby on a binary-RE task is concrete:
Strings-as-buffers + slicing make navigation ergonomic — no cursor object, no read/seek ceremony, justdata[off, len]
.with a one-character vocabulary for every integer and float width and endianness.unpack
is a complete, fast, C-backed binary decoderZero dependencies. The whole reader is standard library. A research tool that has to run on a stranger’s machine in five years should not depend on a gem whose API has since drifted.It reads like the spec. When the code that classifies a record is short enough to hold in your head, the codebecomesyour documentation of the format — which is the entire point of reverse-engineering.The REPL closes the loop. During the actual work,irb
withFile.binread
and a one-lineunpack
is the fastest way to ask “what is at offset0x5c00
?” and get an answer before the thought has finished.
The gotchas are few and all about staying in binary-land: read with binread
, write
binary constants with .b
, get the endianness directive right (V
/e
, not N
/g
),
use unpack1
when you want one value instead of an array, and treat NaN
with respect. None of them are Ruby’s fault; they’re just what binary is.
A 2003 racing game’s AI, a four-byte magic, a table of offsets, and a few hundred
thousand little-endian floats — all read by twenty lines of standard-library Ruby. The
language people use for has_many :comments
turns out to be a perfectly good disassembler’s notebook — and, paired with an AI that never tires of unpacking the next sixteen bytes, a fast one.
Where to look
The full reader is open source — String#unpack
in anger across two table layouts and four games:
Repository:(MIT)github.com/davidslv/bigf
- The container parser:
lib/bigf/archive.rb
· the record decoder:lib/bigf/toca/profile.rb