The Git Filesystem - Recreating the Content-Addressable Database

The article explains that Git is fundamentally a content-addressable filesystem, not just a version control system, and that its core is the object store located in the `.git/objects` directory. It describes Git's four object types (blob, tree, commit, tag) and details the exact binary format of loose objects, which are stored as `[type] SP [size] NUL [content bytes]` before being zlib-compressed. The author aims to teach readers how to write a valid Git blob object from scratch in Go, demonstrating a deep understanding of Git's underlying mechanics rather than just its commands.

You've used Git every single day for years. You've resolved merge conflicts at midnight, force-pushed to the wrong branch, lost an afternoon to a detached HEAD. You know git rebase -i well enough to teach it. You've explained pull requests to interns. And yet — if someone put a gun to your head and asked "what actually happens when you run git commit?" — you'd say something like "it... saves a snapshot?" and hope they move on. Don't feel bad. Most developers interact with Git the same way they interact with their car engine: turn the key, go. The abstraction holds until the day it doesn't — until git reflog saves your job or a corrupted object wrecks a deploy and you're staring at .git/objects with no idea what you're looking at. This series is about closing that gap. Not "here are some advanced Git commands" — there are plenty of cheat sheets for that. This is about understanding the machine underneath. The object store. The pack format. The wire protocol. The actual bytes on disk. By the end of this post, you'll be able to write a valid Git blob object in Go — one that git cat-file accepts without complaint — without calling a single Git command. Not as a party trick. As proof that you now understand the primitive Git is built on. Stop thinking of Git as a history tracker. It's a content-addressable filesystem with a thin VCS wrapper bolted on top. When Linus Torvalds wrote the first version of Git in 2005, he didn't start by designing branches or merges. He started by designing an object store. The VCS layer came after . That design decision explains everything that's confusing about Git — and everything that's elegant about it. This post tears apart .git/objects , explains exactly what's on disk, and walks through a Go implementation that writes a real, valid Git object from scratch. No magic. No waving hands at "Git uses SHA-1." The .git Directory Is the Repository Clone a repo, delete every file except .git , and you've lost nothing. The working tree is just a checked-out view. The actual data — every version of every file ever committed — lives in .git/objects . .git/ ├── HEAD pointer to current branch ├── config repo-local config ├── objects/ THE database │ ├── 4b/ │ │ └── 825dc642cb6eb9a060e54bf8d69288fbee4904 │ ├── info/ │ └── pack/ packed objects covered in Post 2 └── refs/ ├── heads/ branch tips └── tags/ That file at 4b/825dc6... is the empty tree object. It exists in every Git repository ever initialized. The name IS its content — hash the empty tree structure and you'll always get 4b825dc642cb6eb9a060e54bf8d69288fbee4904 . That's the contract. Four Object Types, One Storage Format Git has exactly four object types. Everything in your repository decomposes into these: | Type | What it stores | |---|---| blob | File contents no filename, no permissions | tree | Directory listing filenames + modes + child object hashes | commit | Snapshot pointer tree hash + parent + author + message | tag | Named pointer usually to a commit | A critical insight: blobs don't know their filename . The tree that references them knows the filename. This is why renaming a file in Git is "free" — the blob object is reused, only the tree changes. And that's why two files with identical contents — anywhere in the repo, across any branch — share a single blob. The Wire Format of a Loose Object Every loose object follows the same binary format before being zlib-compressed: type SP size NUL content bytes For a file containing hello\n : blob 6\0hello\n That's it. The blob type string, a space, the byte count of the content not the header , a null byte, then the raw content. Hash this entire sequence with SHA-1 and you get the object's identity. Write it zlib-compressed to .git/objects/ first-2-hex-chars / remaining-38-hex-chars . Let's verify this with raw shell commands before implementing it in Go: bash Write "hello\n" as a blob $ echo "hello" | git hash-object -w --stdin ce013625030ba8dba906f756967f9e9ca394464a Confirm the file exists $ xxd .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a | head -3 00000000: 789c cb 48 cd c9 c9 57 30 b4 65 00 00 0b 27 02 a7 x..H..W0.e...'.. Decompress and inspect $ python3 -c " import zlib with open '.git/objects/ce/013625030ba8dba906f756967f9e9ca394464a', 'rb' as f: print repr zlib.decompress f.read " b'blob 6\x00hello\n' There's the header: blob 6\x00 . Exactly as described. Why Content-Addressability Matters The SHA-1 hash of an object is its address. This has radical consequences: Deduplication is automatic. If 10,000 commits all include an unchanged README.md , there's exactly one blob object for it. The trees in those commits all point to the same hash. No dedup logic required — identical content produces identical hashes. Corruption is detectable. Flip one bit in any object file and git fsck will catch it immediately. The stored filename no longer matches the hash of the decompressed content. Caching is trivially correct. If you have object abc123 , you have it forever, unchanged. There's no cache invalidation problem. Object identity is content identity. This is why git clone is trustworthy across mirrors — you don't have to trust the mirror, only the hash. ⚠️ The SHA-1 → SHA-256 TransitionIn 2017, Google's SHAttered attack produced two PDFs with the same SHA-1 hash. For Git, the risk isn't casual: it was demonstrated that it's essentially possible to create two Git repositories with the same head commit hash but different contents — say, a benign source and a backdoored one.Git has been transitioning to SHA-256. Git 2.29 2020 introduced experimental support for SHA-256 via the sha256 object format. Repositories can now be initialized with git init --object-format=sha256 . Git 2.51 further progressed this transition, and backward compatibility remains critical since Git is used by millions of developers. The tricky part: Torvalds noted that "Git doesn't actually just hash the data — it prepends a type/length field to it. That usually tends to make collision attacks much harder, because you either have to make the resulting size the same too, or you have to be able to also edit the size field in the header." Building It: A Go Implementation Enough theory. Here's a self-contained Go program that writes a valid Git blob object — one that git cat-file can read back: package main import "compress/zlib" "crypto/sha1" "fmt" "os" "path/filepath" // GitObject represents a raw Git object before storage. type GitObject struct { Type string Content byte } // Header returns the type SP size NUL prefix Git prepends to all objects. func o GitObject Header byte { return byte fmt.Sprintf "%s %d\x00", o.Type, len o.Content } // Store computes the SHA-1 hash of header+content, zlib-compresses the full // payload, and writes it to the correct nested path under .git/objects/. func o GitObject Store gitDir string string, error { // Step 1: build the full store payload = header + raw content header := o.Header payload := append header, o.Content... // Step 2: SHA-1 hash the payload — this IS the object's identity sum := sha1.Sum payload hash := fmt.Sprintf "%x", sum // Step 3: derive the on-disk path: objects/ 2-char / 38-char objDir := filepath.Join gitDir, "objects", hash :2 objPath := filepath.Join objDir, hash 2: // Step 4: skip if object already exists content-addressable = idempotent if , err := os.Stat objPath ; err == nil { return hash, nil } if err := os.MkdirAll objDir, 0755 ; err = nil { return "", fmt.Errorf "mkdir: %w", err } // Step 5: write zlib-compressed payload to disk f, err := os.OpenFile objPath, os.O WRONLY|os.O CREATE|os.O EXCL, 0444 if err = nil { return "", fmt.Errorf "create object file: %w", err } defer f.Close w := zlib.NewWriter f if , err := w.Write payload ; err = nil { return "", fmt.Errorf "compress write: %w", err } if err := w.Close ; err = nil { return "", fmt.Errorf "compress close: %w", err } return hash, nil } // ReadObject reads and decompresses a Git object, returning its type, size, // and raw content. The inverse of Store . func ReadObject gitDir, hash string objType string, content byte, err error { path := filepath.Join gitDir, "objects", hash :2 , hash 2: f, err := os.Open path if err = nil { return "", nil, fmt.Errorf "open: %w", err } defer f.Close r, err := zlib.NewReader f if err = nil { return "", nil, fmt.Errorf "zlib: %w", err } defer r.Close var raw byte buf := make byte, 4096 for { n, readErr := r.Read buf raw = append raw, buf :n ... if readErr = nil { break } } // Parse header: find the null byte separating header from content nullIdx := -1 for i, b := range raw { if b == 0x00 { nullIdx = i break } } if nullIdx == -1 { return "", nil, fmt.Errorf "malformed object: no null byte in header" } // Header format: " type size " var size int header := string raw :nullIdx fmt.Sscanf header, "%s %d", &objType, &size return objType, raw nullIdx+1: , nil } func main { content := byte "hello, git internals\n" obj := &GitObject{ Type: "blob", Content: content, } hash, err := obj.Store ".git" if err = nil { fmt.Fprintf os.Stderr, "error: %v\n", err os.Exit 1 } fmt.Printf "wrote blob: %s\n", hash fmt.Printf "path: .git/objects/%s/%s\n", hash :2 , hash 2: // Round-trip: read it back objType, readContent, err := ReadObject ".git", hash if err = nil { fmt.Fprintf os.Stderr, "read error: %v\n", err os.Exit 1 } fmt.Printf "read back: type=%s content=%q\n", objType, readContent // Cross-verify with git cat-file fmt.Printf "\nVerify with: git cat-file -p %s\n", hash fmt.Printf "Type check: git cat-file -t %s\n", hash } Run this in any Git repo directory and then verify: bash $ go run main.go wrote blob: a97d3ee72059abfabb3cd99748d61e36bcc2b2c5 path: .git/objects/a9/7d3ee72059abfabb3cd99748d61e36bcc2b2c5 read back: type=blob content="hello, git internals\n" Verify with: git cat-file -p a97d3ee72059abfabb3cd99748d61e36bcc2b2c5 $ git cat-file -p a97d3ee72059abfabb3cd99748d61e36bcc2b2c5 hello, git internals Git accepts the object as legitimate — because it is legitimate. Same format, same hash algorithm. The Store function is exactly what git hash-object -w does internally. The Two-Character Directory Split Why objects/a9/7d3ee7... instead of objects/a97d3ee7... ? Filesystems degrade when directories contain too many entries. Old ext2 used linear lookups; even modern filesystems show measurable slowdown with hundreds of thousands of entries in a single directory. By splitting on the first two hex characters, Git caps any single directory at 256 entries one per two-hex-char prefix . A repository with a million objects distributes them across at most 256 directories of ~3,900 objects each. It's a pragmatic workaround for filesystem limitations, not an algorithmic requirement. The SHA-256 transition keeps the same split objects/ 2 / 62 , just with longer filenames. What a Tree Object Actually Looks Like Trees deserve their own dissection. The format is binary, not the text format of blobs and commits: mode SP filename NUL 20-byte-raw-SHA-1 mode SP filename NUL 20-byte-raw-SHA-1 ... Notice: raw bytes , not hex. Each entry is a mode string like 100644 for a regular file, 040000 for a subdirectory, 100755 for executable , the filename, a null byte, and then 20 raw bytes of the referenced object's SHA-1. You can inspect this manually: bash Get the tree hash of HEAD $ git cat-file -p HEAD tree 9ab456... ... Inspect the tree git pretty-prints it $ git cat-file -p 9ab456 100644 blob ce013625030ba8dba906f756967f9e9ca394464a README.md 040000 tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904 src/ But the raw bytes are not ASCII — decode them yourself: $ python3 -c " import zlib with open '.git/objects/9a/b456...', 'rb' as f: raw = zlib.decompress f.read skip header find null byte null = raw.index b'\x00' entries = raw null+1: each entry: mode SP name NUL 20-bytes i = 0 while i < len entries : space = entries.index b' ', i mode = entries i:space .decode null2 = entries.index b'\x00', space name = entries space+1:null2 .decode sha = entries null2+1:null2+21 .hex print f'{mode} {name} - {sha}' i = null2 + 21 " Plumbing Commands You Should Know Git ships with "porcelain" user-facing and "plumbing" low-level commands. When you're working at this level, you want plumbing: bash Hash content without writing dry run $ echo "test" | git hash-object --stdin Hash AND write to object store $ echo "test" | git hash-object -w --stdin Inspect object type $ git cat-file -t <hash Print object content pretty-printed for trees $ git cat-file -p <hash Print raw content size $ git cat-file -s <hash Verify object database integrity $ git fsck --unreachable Show all objects in the store $ git rev-list --objects --all Manually trigger loose → packed object consolidation $ git gc The git cat-file -p output is what feeds into higher-level operations. Everything git does at the porcelain level eventually bottoms out in reads and writes of these four object types. The Immutability Guarantee One last thing worth internalizing: Git objects are immutable by construction. Once written, they never change. git commit --amend doesn't modify the old commit — it writes a new commit object and updates the branch reference to point to it. The old commit still exists in .git/objects until garbage-collected. This is why git reflog can recover "lost" commits. They're not lost — the objects are still on disk. Only the references have been updated. git gc eventually removes unreachable objects, but by default only after they've been unreachable for 30 days. The content-addressable store is the engine. Branches, tags, the index, the staging area — all of it is built on top of this one primitive: a compressed, hashed, immutable object store. Next Up In Post 2: Pack Files and Delta Compression , we dissect how Git takes thousands of loose objects and compresses them into a single binary packfile using delta encoding — storing only the differences between similar objects. For a file that's 10 MB with changes generating 100 KB deltas each time, storing 100 versions as loose objects costs ~1 GB. Pack files with delta compression reduce that to roughly 10 MB. We'll implement a pack file reader and inspect the binary format byte by byte. Further Reading - Git Internals – Git Objects https://git-scm.com/book/en/v2/Git-Internals-Git-Objects — the canonical reference from Pro Git - Git's Database Internals I: Packed Object Store https://github.blog/open-source/git/gits-database-internals-i-packed-object-store/ — GitHub Engineering's deep dive - Hash Function Transition https://git-scm.com/docs/hash-function-transition — the official SHA-256 migration spec - Git source: https://github.com/git/git/blob/master/object.c — the actual C implementation object.c - SHAttered https://shattered.io/ — the SHA-1 collision that accelerated the transition