{"slug": "the-git-filesystem-recreating-the-content-addressable-database", "title": "The Git Filesystem - Recreating the Content-Addressable Database", "summary": "The article explains that Git is fundamentally a content-addressable filesystem, not just a version control system, and that its core is the object store located in the `.git/objects` directory. It describes Git's four object types (blob, tree, commit, tag) and details the exact binary format of loose objects, which are stored as `[type] SP [size] NUL [content bytes]` before being zlib-compressed. The author aims to teach readers how to write a valid Git blob object from scratch in Go, demonstrating a deep understanding of Git's underlying mechanics rather than just its commands.", "body_md": "You've used Git every single day for years. You've resolved merge conflicts at midnight, force-pushed to the wrong branch, lost an afternoon to a detached HEAD. You know `git rebase -i`\n\nwell enough to teach it. You've explained pull requests to interns.\n\nAnd yet — if someone put a gun to your head and asked *\"what actually happens when you run git commit?\"* — you'd say something like \"it... saves a snapshot?\" and hope they move on.\n\nDon't feel bad. Most developers interact with Git the same way they interact with their car engine: turn the key, go. The abstraction holds until the day it doesn't — until `git reflog`\n\nsaves your job or a corrupted object wrecks a deploy and you're staring at `.git/objects`\n\nwith no idea what you're looking at.\n\nThis series is about closing that gap.\n\nNot \"here are some advanced Git commands\" — there are plenty of cheat sheets for that. This is about understanding the machine underneath. The object store. The pack format. The wire protocol. The actual bytes on disk.\n\nBy the end of this post, you'll be able to write a valid Git blob object in Go — one that `git cat-file`\n\naccepts without complaint — without calling a single Git command. Not as a party trick. As proof that you now understand the primitive Git is built on.\n\nStop thinking of Git as a history tracker. It's a content-addressable filesystem with a thin VCS wrapper bolted on top.\n\nWhen Linus Torvalds wrote the first version of Git in 2005, he didn't start by designing branches or merges. He started by designing an object store. The VCS layer came *after*. That design decision explains everything that's confusing about Git — and everything that's elegant about it.\n\nThis post tears apart `.git/objects`\n\n, explains exactly what's on disk, and walks through a Go implementation that writes a real, valid Git object from scratch. No magic. No waving hands at \"Git uses SHA-1.\"\n\n## The `.git`\n\nDirectory Is the Repository\n\nClone a repo, delete every file except `.git`\n\n, and you've lost nothing. The working tree is just a checked-out view. The *actual* data — every version of every file ever committed — lives in `.git/objects`\n\n.\n\n```\n.git/\n├── HEAD                 # pointer to current branch\n├── config               # repo-local config\n├── objects/             # THE database\n│   ├── 4b/\n│   │   └── 825dc642cb6eb9a060e54bf8d69288fbee4904\n│   ├── info/\n│   └── pack/            # packed objects (covered in Post 2)\n└── refs/\n    ├── heads/           # branch tips\n    └── tags/\n```\n\nThat file at `4b/825dc6...`\n\nis the empty tree object. It exists in every Git repository ever initialized. The name IS its content — hash the empty tree structure and you'll always get `4b825dc642cb6eb9a060e54bf8d69288fbee4904`\n\n. That's the contract.\n\n## Four Object Types, One Storage Format\n\nGit has exactly four object types. Everything in your repository decomposes into these:\n\n| Type | What it stores |\n|---|---|\n`blob` |\nFile contents (no filename, no permissions) |\n`tree` |\nDirectory listing (filenames + modes + child object hashes) |\n`commit` |\nSnapshot pointer (tree hash + parent + author + message) |\n`tag` |\nNamed pointer (usually to a commit) |\n\nA critical insight: **blobs don't know their filename**. The *tree* that references them knows the filename. This is why renaming a file in Git is \"free\" — the blob object is reused, only the tree changes. And that's why two files with identical contents — anywhere in the repo, across any branch — share a single blob.\n\n## The Wire Format of a Loose Object\n\nEvery loose object follows the same binary format before being zlib-compressed:\n\n```\n[type] SP [size] NUL [content bytes]\n```\n\nFor a file containing `hello\\n`\n\n:\n\n```\nblob 6\\0hello\\n\n```\n\nThat's it. The `blob`\n\ntype string, a space, the byte count of the *content* (not the header), a null byte, then the raw content. Hash this entire sequence with SHA-1 and you get the object's identity. Write it zlib-compressed to `.git/objects/[first-2-hex-chars]/[remaining-38-hex-chars]`\n\n.\n\nLet's verify this with raw shell commands before implementing it in Go:\n\n``` bash\n# Write \"hello\\n\" as a blob\n$ echo \"hello\" | git hash-object -w --stdin\nce013625030ba8dba906f756967f9e9ca394464a\n\n# Confirm the file exists\n$ xxd .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a | head -3\n00000000: 789c cb 48 cd c9 c9 57 30 b4 65 00 00 0b 27 02 a7  x..H..W0.e...'..\n\n# Decompress and inspect\n$ python3 -c \"\nimport zlib\nwith open('.git/objects/ce/013625030ba8dba906f756967f9e9ca394464a', 'rb') as f:\n    print(repr(zlib.decompress(f.read())))\n\"\nb'blob 6\\x00hello\\n'\n```\n\nThere's the header: `blob 6\\x00`\n\n. Exactly as described.\n\n## Why Content-Addressability Matters\n\nThe SHA-1 hash of an object *is* its address. This has radical consequences:\n\n**Deduplication is automatic.** If 10,000 commits all include an unchanged `README.md`\n\n, there's exactly one blob object for it. The trees in those commits all point to the same hash. No dedup logic required — identical content produces identical hashes.\n\n**Corruption is detectable.** Flip one bit in any object file and `git fsck`\n\nwill catch it immediately. The stored filename no longer matches the hash of the decompressed content.\n\n**Caching is trivially correct.** If you have object `abc123`\n\n, you have it forever, unchanged. There's no cache invalidation problem. Object identity is content identity.\n\nThis is why `git clone`\n\nis trustworthy across mirrors — you don't have to trust the mirror, only the hash.\n\n⚠️\n\nThe SHA-1 → SHA-256 TransitionIn 2017, Google's\n\n[SHAttered attack]produced two PDFs with the same SHA-1 hash. For Git, the risk isn't casual: it was demonstrated that it's essentially possible to create two Git repositories with the same head commit hash but different contents — say, a benign source and a backdoored one.Git has been transitioning to SHA-256. Git 2.29 (2020) introduced experimental support for SHA-256 via the\n\n`sha256`\n\nobject format. Repositories can now be initialized with`git init --object-format=sha256`\n\n. Git 2.51 further progressed this transition, and backward compatibility remains critical since Git is used by millions of developers. The tricky part: Torvalds noted that \"Git doesn't actually just hash the data — it prepends a type/length field to it. That usually tends to make collision attacks much harder, because you either have to make the resulting size the same too, or you have to be able to also edit the size field in the header.\"\n\n## Building It: A Go Implementation\n\nEnough theory. Here's a self-contained Go program that writes a valid Git blob object — one that `git cat-file`\n\ncan read back:\n\n```\npackage main\n\nimport (\n    \"compress/zlib\"\n    \"crypto/sha1\"\n    \"fmt\"\n    \"os\"\n    \"path/filepath\"\n)\n\n// GitObject represents a raw Git object before storage.\ntype GitObject struct {\n    Type    string\n    Content []byte\n}\n\n// Header returns the [type] SP [size] NUL prefix Git prepends to all objects.\nfunc (o *GitObject) Header() []byte {\n    return []byte(fmt.Sprintf(\"%s %d\\x00\", o.Type, len(o.Content)))\n}\n\n// Store computes the SHA-1 hash of header+content, zlib-compresses the full\n// payload, and writes it to the correct nested path under .git/objects/.\nfunc (o *GitObject) Store(gitDir string) (string, error) {\n    // Step 1: build the full store payload = header + raw content\n    header := o.Header()\n    payload := append(header, o.Content...)\n\n    // Step 2: SHA-1 hash the payload — this IS the object's identity\n    sum := sha1.Sum(payload)\n    hash := fmt.Sprintf(\"%x\", sum)\n\n    // Step 3: derive the on-disk path: objects/[2-char]/[38-char]\n    objDir := filepath.Join(gitDir, \"objects\", hash[:2])\n    objPath := filepath.Join(objDir, hash[2:])\n\n    // Step 4: skip if object already exists (content-addressable = idempotent)\n    if _, err := os.Stat(objPath); err == nil {\n        return hash, nil\n    }\n\n    if err := os.MkdirAll(objDir, 0755); err != nil {\n        return \"\", fmt.Errorf(\"mkdir: %w\", err)\n    }\n\n    // Step 5: write zlib-compressed payload to disk\n    f, err := os.OpenFile(objPath, os.O_WRONLY|os.O_CREATE|os.O_EXCL, 0444)\n    if err != nil {\n        return \"\", fmt.Errorf(\"create object file: %w\", err)\n    }\n    defer f.Close()\n\n    w := zlib.NewWriter(f)\n    if _, err := w.Write(payload); err != nil {\n        return \"\", fmt.Errorf(\"compress write: %w\", err)\n    }\n    if err := w.Close(); err != nil {\n        return \"\", fmt.Errorf(\"compress close: %w\", err)\n    }\n\n    return hash, nil\n}\n\n// ReadObject reads and decompresses a Git object, returning its type, size,\n// and raw content. The inverse of Store().\nfunc ReadObject(gitDir, hash string) (objType string, content []byte, err error) {\n    path := filepath.Join(gitDir, \"objects\", hash[:2], hash[2:])\n\n    f, err := os.Open(path)\n    if err != nil {\n        return \"\", nil, fmt.Errorf(\"open: %w\", err)\n    }\n    defer f.Close()\n\n    r, err := zlib.NewReader(f)\n    if err != nil {\n        return \"\", nil, fmt.Errorf(\"zlib: %w\", err)\n    }\n    defer r.Close()\n\n    var raw []byte\n    buf := make([]byte, 4096)\n    for {\n        n, readErr := r.Read(buf)\n        raw = append(raw, buf[:n]...)\n        if readErr != nil {\n            break\n        }\n    }\n\n    // Parse header: find the null byte separating header from content\n    nullIdx := -1\n    for i, b := range raw {\n        if b == 0x00 {\n            nullIdx = i\n            break\n        }\n    }\n    if nullIdx == -1 {\n        return \"\", nil, fmt.Errorf(\"malformed object: no null byte in header\")\n    }\n\n    // Header format: \"[type] [size]\"\n    var size int\n    header := string(raw[:nullIdx])\n    fmt.Sscanf(header, \"%s %d\", &objType, &size)\n\n    return objType, raw[nullIdx+1:], nil\n}\n\nfunc main() {\n    content := []byte(\"hello, git internals\\n\")\n\n    obj := &GitObject{\n        Type:    \"blob\",\n        Content: content,\n    }\n\n    hash, err := obj.Store(\".git\")\n    if err != nil {\n        fmt.Fprintf(os.Stderr, \"error: %v\\n\", err)\n        os.Exit(1)\n    }\n\n    fmt.Printf(\"wrote blob: %s\\n\", hash)\n    fmt.Printf(\"path: .git/objects/%s/%s\\n\", hash[:2], hash[2:])\n\n    // Round-trip: read it back\n    objType, readContent, err := ReadObject(\".git\", hash)\n    if err != nil {\n        fmt.Fprintf(os.Stderr, \"read error: %v\\n\", err)\n        os.Exit(1)\n    }\n\n    fmt.Printf(\"read back: type=%s content=%q\\n\", objType, readContent)\n\n    // Cross-verify with git cat-file\n    fmt.Printf(\"\\nVerify with: git cat-file -p %s\\n\", hash)\n    fmt.Printf(\"Type check:  git cat-file -t %s\\n\", hash)\n}\n```\n\nRun this in any Git repo directory and then verify:\n\n``` bash\n$ go run main.go\nwrote blob: a97d3ee72059abfabb3cd99748d61e36bcc2b2c5\npath: .git/objects/a9/7d3ee72059abfabb3cd99748d61e36bcc2b2c5\nread back: type=blob content=\"hello, git internals\\n\"\n\nVerify with: git cat-file -p a97d3ee72059abfabb3cd99748d61e36bcc2b2c5\n$ git cat-file -p a97d3ee72059abfabb3cd99748d61e36bcc2b2c5\nhello, git internals\n```\n\nGit accepts the object as legitimate — because it *is* legitimate. Same format, same hash algorithm. The `Store`\n\nfunction is exactly what `git hash-object -w`\n\ndoes internally.\n\n## The Two-Character Directory Split\n\nWhy `objects/a9/7d3ee7...`\n\ninstead of `objects/a97d3ee7...`\n\n?\n\nFilesystems degrade when directories contain too many entries. Old `ext2`\n\nused linear lookups; even modern filesystems show measurable slowdown with hundreds of thousands of entries in a single directory. By splitting on the first two hex characters, Git caps any single directory at 256 entries (one per two-hex-char prefix). A repository with a million objects distributes them across at most 256 directories of ~3,900 objects each.\n\nIt's a pragmatic workaround for filesystem limitations, not an algorithmic requirement. The SHA-256 transition keeps the same split (`objects/[2]/[62]`\n\n), just with longer filenames.\n\n## What a Tree Object Actually Looks Like\n\nTrees deserve their own dissection. The format is binary, not the text format of blobs and commits:\n\n```\n[mode] SP [filename] NUL [20-byte-raw-SHA-1]\n[mode] SP [filename] NUL [20-byte-raw-SHA-1]\n...\n```\n\nNotice: **raw bytes**, not hex. Each entry is a mode string (like `100644`\n\nfor a regular file, `040000`\n\nfor a subdirectory, `100755`\n\nfor executable), the filename, a null byte, and then 20 raw bytes of the referenced object's SHA-1.\n\nYou can inspect this manually:\n\n``` bash\n# Get the tree hash of HEAD\n$ git cat-file -p HEAD\ntree 9ab456...\n...\n\n# Inspect the tree (git pretty-prints it)\n$ git cat-file -p 9ab456\n100644 blob ce013625030ba8dba906f756967f9e9ca394464a    README.md\n040000 tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904    src/\n\n# But the raw bytes are not ASCII — decode them yourself:\n$ python3 -c \"\nimport zlib\nwith open('.git/objects/9a/b456...', 'rb') as f:\n    raw = zlib.decompress(f.read())\n# skip header (find null byte)\nnull = raw.index(b'\\x00')\nentries = raw[null+1:]\n# each entry: mode SP name NUL 20-bytes\ni = 0\nwhile i < len(entries):\n    space = entries.index(b' ', i)\n    mode = entries[i:space].decode()\n    null2 = entries.index(b'\\x00', space)\n    name = entries[space+1:null2].decode()\n    sha = entries[null2+1:null2+21].hex()\n    print(f'{mode} {name} -> {sha}')\n    i = null2 + 21\n\"\n```\n\n## Plumbing Commands You Should Know\n\nGit ships with \"porcelain\" (user-facing) and \"plumbing\" (low-level) commands. When you're working at this level, you want plumbing:\n\n``` bash\n# Hash content without writing (dry run)\n$ echo \"test\" | git hash-object --stdin\n\n# Hash AND write to object store\n$ echo \"test\" | git hash-object -w --stdin\n\n# Inspect object type\n$ git cat-file -t <hash>\n\n# Print object content (pretty-printed for trees)\n$ git cat-file -p <hash>\n\n# Print raw content size\n$ git cat-file -s <hash>\n\n# Verify object database integrity\n$ git fsck --unreachable\n\n# Show all objects in the store\n$ git rev-list --objects --all\n\n# Manually trigger loose → packed object consolidation\n$ git gc\n```\n\nThe `git cat-file -p`\n\noutput is what feeds into higher-level operations. Everything git does at the porcelain level eventually bottoms out in reads and writes of these four object types.\n\n## The Immutability Guarantee\n\nOne last thing worth internalizing: **Git objects are immutable by construction.** Once written, they never change. `git commit --amend`\n\ndoesn't modify the old commit — it writes a *new* commit object and updates the branch reference to point to it. The old commit still exists in `.git/objects`\n\nuntil garbage-collected.\n\nThis is why `git reflog`\n\ncan recover \"lost\" commits. They're not lost — the objects are still on disk. Only the references have been updated. `git gc`\n\neventually removes unreachable objects, but by default only after they've been unreachable for 30 days.\n\nThe content-addressable store is the engine. Branches, tags, the index, the staging area — all of it is built on top of this one primitive: a compressed, hashed, immutable object store.\n\n## Next Up\n\nIn **Post 2: Pack Files and Delta Compression**, we dissect how Git takes thousands of loose objects and compresses them into a single binary packfile using delta encoding — storing only the *differences* between similar objects. For a file that's 10 MB with changes generating 100 KB deltas each time, storing 100 versions as loose objects costs ~1 GB. Pack files with delta compression reduce that to roughly 10 MB. We'll implement a pack file reader and inspect the binary format byte by byte.\n\n## Further Reading\n\n-\n[Git Internals – Git Objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects)— the canonical reference from Pro Git -\n[Git's Database Internals I: Packed Object Store](https://github.blog/open-source/git/gits-database-internals-i-packed-object-store/)— GitHub Engineering's deep dive -\n[Hash Function Transition](https://git-scm.com/docs/hash-function-transition)— the official SHA-256 migration spec -\n[Git source:](https://github.com/git/git/blob/master/object.c)— the actual C implementation`object.c`\n\n-\n[SHAttered](https://shattered.io/)— the SHA-1 collision that accelerated the transition", "url": "https://wpnews.pro/news/the-git-filesystem-recreating-the-content-addressable-database", "canonical_source": "https://dev.to/arnabsantra2004/the-git-filesystem-recreating-the-content-addressable-database-444h", "published_at": "2026-05-23 18:50:27+00:00", "updated_at": "2026-05-23 19:02:57.553864+00:00", "lang": "en", "topics": ["developer-tools", "open-source", "data"], "entities": ["Git", "Linus Torvalds", "Go"], "alternates": {"html": "https://wpnews.pro/news/the-git-filesystem-recreating-the-content-addressable-database", "markdown": "https://wpnews.pro/news/the-git-filesystem-recreating-the-content-addressable-database.md", "text": "https://wpnews.pro/news/the-git-filesystem-recreating-the-content-addressable-database.txt", "jsonld": "https://wpnews.pro/news/the-git-filesystem-recreating-the-content-addressable-database.jsonld"}}