{"slug": "i-taught-a-bucket-to-speak-git", "title": "I taught a bucket to speak Git", "summary": "A developer taught a Tigris object storage bucket to act as a Git server by leveraging the go-git library's filesystem abstraction, billy. The experiment demonstrates that Git's content-addressed objects are a natural fit for append-only object storage, potentially enabling stateless, decentralized Git hosting without single points of failure.", "body_md": "[Blog](/blog/)/\n\n[build with tigris](/blog/tags/build-with-tigris/)\n\n# I taught a bucket to speak git\n\nWhat happens if I just point a git server at an object storage bucket?\n\nBack when I was porting\n[agent sandboxes to Go](https://www.tigrisdata.com/blog/agent-sandbox-go/), I\nbuilt everything on top of\n[billy](https://pkg.go.dev/github.com/go-git/go-billy/v6), a filesystem\nabstraction for Go. The whole trick of the project was teaching a Tigris bucket\nto act enough like a filesystem that a shell interpreter and its tools couldn’t\ntell the difference. Billy was the key layer that made the entire façade fall\ninto place.\n\nAfter I had gotten things working, I learned that I’m using billy way outside\nits normal usecase. It was originally made for\n[go-git](https://pkg.go.dev/github.com/go-git/go-git/v6), a pure-Go\nimplementation of git’s protocols and data formats. It doesn’t rely on the\n`/usr/bin/git`\n\nbinary existing at all. Every method on billy’s filesystem\ninterface exists purely because go-git needs it. This gave me a terrible idea: I\nalready have a bucket that can quack like a filesystem and go-git’s native\nlanguage is “filesystem”.\n\nCan this Just Work™? Let's find out.\n\n## Git was always an object store\n\nIf you strip away the porcelain, a git repository is 4 basic things:\n\n- Objects, or compressed blobs of data. Most of the objects in any individual repository are files.\n- Trees, or objects that map to other objects. TL;DR: trees are folders.\n- Commits, or objects that point at one tree and their parent commit. This lets you pin down which files belong to one logical change set.\n- Refs, branches and tags, they are tiny mutable pointers into the pile of objects.\n\nUntil I started working on this I was under the impression that git stored only the patches done to an empty folder and that was how it reconstructed the history of your repository. It does not. It actually keeps track of the entire files, which explains why big binary blobs fudge the tooling so much. The diff mental model works fine for using git day to day; it’s just wrong at the storage layer, which is the layer this post lives in.\n\nFor example, let’s say I just made a new git repository and committed a\nREADME.md to it. The tree for the `.git`\n\nfolder looks something like this:\n\n``` bash\n$ tree .git.git├── COMMIT_EDITMSG├── config├── HEAD├── index├── objects│   ├── 5e│   │   └── b8151eb669aa4467b6dea2c4bce19183cd0b41│   ├── 6a│   │   └── 6a8ecfcae2632152486aca3d9150ef83dedd66│   ├── f4│   │   └── d2487a1c6d742c8037c0296ddf80625190bd80│   ├── info│   └── pack└── refs    ├── heads    │   └── main    └── tags\n```\n\nAs you can see there are three objects. One of them is the commit\n`5eb8151eb669aa4467b6dea2c4bce19183cd0b41`\n\n, the next is the tree, and the last\none is the README file. The `main`\n\nbranch also points to that commit:\n\n``` bash\n$ cat .git/refs/heads/main5eb8151eb669aa4467b6dea2c4bce19183cd0b41\n```\n\nThe cool part is that half of this is content-addressed. The content-addressed\nbits *never change* once they’ve been committed. Git objects are a great fit for\nTigris’ internal model because they are append-only storage, just like\n[the fundamental model Tigris is built upon](https://www.tigrisdata.com/blog/append-only-storage/).\nThe things that do change often are the refs, which are updated to point to the\nlatest commit. These are *tiny* files though, which means that Tigris can handle\nthem with no effort required.\n\nHowever, when we host git repositories on a server, we end up creating single points of failure. Our git repos are hosted on single machines that can and will break. The entire implementation relies on git objects being 1:1 correlated with filesystem objects because everyone (even GitHub) shells out to the git binary to actually store files. Hosting git repos becomes one of the most stateful services in our stateless cloud-native environment.\n\nSure git is in-theory decentralized, but most of us have ended up using that to put our git repositories in one big store that has questionable uptime practices: GitHub. To be fair to hubbers, GitHub operates at a scale that none of us can really think about. They’ve been pushing the limits since their inception where they had to get Engine Yard to keep building them bigger servers to handle the load. They have to do everything with a big mounted filesystem because git’s tooling gives them no other option.\n\n## A travesty of horrors beyond human comprehension\n\nNow suppose this weirdness bothers you enough to do something about it. To build\na git server **without** storing everything in the local filesystem, you have to\nspeak git somehow, and the conventional options aren’t really all that great:\n\n- If you shell out to the git binary, now your “library” is the argv of the git\nprocess and your error handling is screen-scraping output. Internally, git\nimplements its functionality with a billionty subcommands rather than exposing\nit all as a library. The codebase is held together by load-bearing calls to\n`die()`\n\n, which kills the process. - If you link into git’s guts with\n`libgit`\n\n, you inherit the “when things go bad,`die()`\n\n” behaviour and your app now suddenly starts crashing at random. This is not good for uptime. - If you try to use\n`libgit2`\n\n(the rewrite-that’s-actually-a-library), you have to reckon with the fact that it’s addled by the GPL (with a linking exception, try explaining that to your lawyers), you have to eat the jump to C every time you do anything with git (very often), development has stalled, the Go bindings have been archived, and it still assumes a local filesystem despite assurances it does not.\n\nIt might sound hopeless, right? You may be able to use WebAssembly or something\nto contain the madness (assuming you have a good way to implement\n`fork()`\n\n/`exec()`\n\nor `posix_spawn()`\n\nor something similar), but what if there\nwas a pure Go library that could handle this all for us?\n\nEnter [go-git](https://pkg.go.dev/github.com/go-git/go-git/v6), a pure-go\nimplementation of the git protocol and internals from scratch. This doesn’t rely\non cgo or `/usr/bin/git`\n\nand it does not assume the repositories are stored in\nthe local filesystem. Its storage interface is written against billy, the exact\ninterface I’ve already taught to speak Tigris. I wanted a git server that was\njust in a bucket and the pieces were sitting there and calling to me.\n\n## Oh no, it works\n\nSo I hacked up [objgit](https://github.com/tigrisdata/objgit), a git server\nbacked by object storage. The only filesystem call I had to add to get it\nbooting was `MkdirAll`\n\n. I wired up the\n[ transport](https://pkg.go.dev/github.com/go-git/go-git/v6/plumbing/transport)\npackage to a socket to implement the plaintext\n\n`git`\n\nprotocol, hooked it up to a\nbucket, and pushed the repo I was currently working on.To my absolute astonishment, it worked.\n\nGit pushed, pulled, logged, blamed, tagged, the whole kit and kaboodle. I didn’t have to implement git myself, I just committed an egregious amount of shoving a square peg into a round hole until the peg went in.\n\nIn hindsight this makes an annoying amount of sense. A bare repo is those four\nkinds of things on a filesystem; swap the filesystem for object storage and\neverything else Just Working™ is perfectly logical. Git’s on-disk format *is*\nits database schema and if you fake open/stat/rename convincingly enough the\nentire façade keeps working because APIs are the lies we tell ourselves to make\nus sleep at night.\n\nAfter a lot of hacking, I ended up with a feature list kinda like this:\n\n- Push and pull over three transports: HTTP, classic\n`git://`\n\n, and SSH - Repositories upserted on first push\n- Absolutely no effort put into authentication as this is an experiment and authentication is annoying and complicated\n- Prometheus metrics so I could optimize the filesystem layer\n\nEverything comes out of one Go binary with no local state, even the generated SSH keys are stored in the bucket. You can run this in a Kubernetes cluster with only the mutable storage required being temporary files for an optimistic cache when doing smart git clones.\n\nThe rest of this post is what it took to get from “oh no, it works” to something close to usable.\n\nObligatory disclaimer (like the best things in life): this is an experiment. It has not been tested thoroughly or vetted for correctness. If it breaks in half, you get to keep both pieces. Please do not move your company’s monorepo onto this and then email me when it catches fire.\n\n## That one POSIX idiom that survived\n\nGit is paranoid about durability, and its entire strategy is one Unix idiom that\nyou end up seeing many places: write new data to a temporary file and then\n`rename(2)`\n\nit into place after you’ve assured it’s correct. POSIX guarantees\nthat rename is atomic, so readers either see the old file or the new one, not an\nintermediate state inbetwixt the two. Packfiles (bundles of objects) land as\ntemporary files when uploaded then moved to their permanent home. Refs are\nwritten as locked temporary files and then renamed over the ref. It’s rename all\nthe way down.\n\nObject storage traditionally does not have rename as one atomic operation. S3’s\nanswer is to create exactly that intermediate state: `CopyObject`\n\nto the new\nplace and `DeleteObject`\n\non the old one. This makes the most load-bearing idiom\nin Git’s philosophy fall to pieces.\n\nLuckily, Tigris has an extension for this:\n[ RenameObject](https://www.tigrisdata.com/docs/objects/object-rename). To use\nit, pass an additional\n\n`X-Tigris-Rename: true`\n\nheader to a `CopyObject`\n\ncall and\ninstead of copying then deleting on the client, it moves the metadata around on\nthe server. One round trip, no data movement, and the Unix idiom maps on the\nbucket 1:1. Objgit’s implementation of `Rename`\n\nis trivial:\n\n```\n// internal/s3fs/basic.go// RenameObject is a Tigris extension that renames in place (no data copy),// so we don't need a separate CopyObject + DeleteObject.copySource := fs3.bucket + \"/\" + src_, err := fs3.client.RenameObject(ctx, &s3.CopyObjectInput{    Bucket:     &fs3.bucket,    CopySource: &copySource,    Key:        &dst,})\n```\n\nA second, sneakier violation hides in the same codepath. When go-git writes a\ntemporary file, it creates that temporary file and then **immediately** starts\nopening it for reading so it can build the pack index. You cannot do that with a\nsingle live object in any object storage system, you are either reading or\nwriting, never both. I ended up working around this by cheating a bit and\nbuffering the contents of newly written pack files into memory so that this game\nof chicken kept working. I may have to change this to write that pack cache to\nthe filesystem as trying to push `gcc.git`\n\nmade me run out of RAM. At the very\nleast, everything lies consistently enough that git doesn’t care, so win!\n\n## Death by a thousand `stat()`\n\ncalls\n\nWith this correctness sorted, I tried pushing the\n[golang/go](https://github.com/golang/go) repository to objgit to see how long\nit would take. It did work, but it took *forever*. Using the prometheus metrics\nI mentioned before, I saw that it was making biblical amounts of `HeadObject`\n\ncalls. Some blocking profile analysis pointed to the fact that the git library\nwas using the `stat()`\n\ncall to detect if a file exists. The flow was like:\n\n- Client has object x\n- Check if object x exists\n- Check if any pack has object x\n\nAnd so on ad infinitum. This is fine-ish on a local filesystem because those\nsyscalls resolve in *microseconds*, not the tens of milliseconds it takes to get\nfrom my office to the nearest Tigris region (please expand to Ottawa, I would\nlove that so much).\n\nThis was compounded with a discovery that the transport I was using (SSH —\nclassic `git://`\n\nshares the same code path) was exploding every packfile into\n*loose objects* when pushing it. Each loose object write was costing two round\ntrips: `stat()`\n\nto check if a file exists and then `open()`\n\n/ `write()`\n\nto\nactually put the data into Tigris. This made a 100,000 object packfile cost\n200,000 object storage calls. Call it 10ms of latency for each one, and that’s\nover half an hour of waiting for responses that mostly say “404 not found”.\n\nCaching can’t really save you here either, read caches would absorb the repeated\nreads; but this is a firehose of *writes* to 100,000 paths that probably have\nnever been read and likely will never be seen again.\n\nThe reason only two transports had this problem is a deadlock story. The git\nlibrary's fast path stores an incoming pack whole through its `PackfileWriter`\n\n,\nby copying from the connection until `io.EOF`\n\n. Over HTTP that's fine: the\nrequest body ends, EOF arrives, everyone goes home. Over `git://`\n\nand SSH, the\nconnection is a persistent socket and the client is holding it open, politely\nwaiting for the server's status report. EOF never comes. The copy waits forever,\nthe client waits forever, and you have invented a distributed deadlock with two\nparticipants. The original workaround was to hide the `PackfileWriter`\n\ncapability on those transports so go-git fell back to its streaming parser that\nwrites every object loose. Hence the stat storm.\n\nSo the solution was to stop depending on EOF at all. Packfiles are\nself-delimiting: the header says how many objects are coming and a trailing\nchecksum marks the end, so a packfile scanner walks the stream and stops at the\ntrailer while a `TeeReader`\n\nmirrors exactly those bytes into the\n`PackfileWriter`\n\n. This makes the rest of the façade fall into place and the git\nlibrary is happy. This made pushes into two uploads: a packfile and its index\ninstead of a torrent of round trips that mean nothing.\n\n## What about cloning?\n\nOnce I got pushing fixed, I moved on to the read path. In order to emulate\n`ReadAt`\n\n, I used ranged `GetObject`\n\nrequests so that the git library could read\nindividual objects out of packfiles. I was happy with this hack, but there was\none problem: the latency curse struck again. Cloning a simple repo with 318\nobjects and a 200KiB packfile made over 8,500 `GetObject`\n\ncalls before I killed\nit.\n\nA git client cloning a repository reads repository packfiles thousands of times with random access, walking objects and candidate delta bases over and over. On a local disk you never notice because your page cache eats that for breakfast. When every call is an HTTP request, a 200KiB repo turns into dozens of megabytes of round trips. A 20MiB repo was effectively unservable.\n\nIn other words, I had un-cached the one workload that caching was designed to solve.\n\nThe fix leans on a gift from git: pack files are immutable and\ncontent-addressed. `pack-<sha>.pack`\n\nwill *never* change for as long as it\nexists. This makes them trivially cacheable to a faster local medium, such as\nthe filesystem. No invalidation logic is required. I made objgit download packs\nto a local temporary folder and serve reads from there. To be on the safe side,\nI did add least-recently-used caching to the mix so that my temp folder wouldn’t\nblow up unexpectedly. This does mean that the first request for pack files is\nslower, but then everything else is at filesystem speed.\n\nYes, this relies on the local disk again, but only as a cache that can and will be thrown away. I think trading a stateless ideal for clones that terminate in reasonable amounts of time is a worthwhile bargain.\n\n## Why so `ListObjectsV2`\n\n, Batman?\n\nOnce the other disasters were out of the way, one more remained: the metrics\nshowed a flood of `ListObjectsV2`\n\ncalls every time a clone was made and didn’t\nstop making those calls after it was done.\n\nTwo things compounded. First, when git looks up an object that isn't packed, it\nprobes for a loose object at `objects/<xx>/<rest-of-hash>`\n\n. objgit keeps packs\nwhole, so there are no loose objects, so every probe misses, and each miss\nacross a distinct two-hex prefix triggered a directory listing to find out.\nThere are 256 possible prefixes. A single clone could issue up to 256\n`ListObjectsV2`\n\ncalls whose collective answer was a resounding \"there is nothing\nhere.\"\n\nSecond (and more embarrassing), the listing cache already had an optimization\nfor this. It collapsed entire subtree lookups into recursive scans so a single\nlisting of the repository could answer every stat() and probe beneath it. It was\ncompletely dead in production. The cache matched recursive prefixes against the\nrepo root (`refs/`\n\n), but every repo is chrooted to its own directory, so real\nkeys look like `myrepo.git/refs/heads/main`\n\n. The prefix check wasn’t aware of\nchroots so it never actually matched anything. Nobody noticed because a cache\nthat degrades to “no caching” still returns the correct answer, just slowly. To\nrub it in, a cache warmer was dutifully re-listing every one of those useless\nprefixes every 30 seconds for 10 minutes after each clone. Thousands of\nbackground list calls were burned in the service of caching nothing of use.\n\nThe fix was insultingly small: when a repo’s filesystem gets chrooted, register\nthat chroot as a recursive subtree root within the cache. This made the cache\nactually useful and resulted in only one `ListObjectsV2`\n\ncall instead of\nhundreds. Every sufficiently advanced cache is indistinguishable from a no-op\nuntil someone graphs the miss rate.\n\nNone of these disasters were exotic. They’re the things filesystems and kernels\ngive you for free — and every perfectly reasonable disk assumption fell to\npieces once a network round trip sat at the core. Serving Git repositories is an\naccidental filesystem latency benchmark. If your storage abstraction has a weak\npoint, Git *will* find it and the metrics will show you where that problem is.\n\n## Post-receive hooks go in clown jail\n\nOne of the most useful parts of hosting your own git server is setting up\npost-receive hooks. These have been used since time immemorial for things like\n[automatic deployments](https://gist.github.com/nonbeing/f3441c96d8577a734fa240039b7113db)\nwhen you push code to the server. The heart of this is how we get systems like\nGitHub Actions: it’s code that runs when you are done pushing.\n\nWhen you push to objgit with `--allow-hooks`\n\nenabled, it looks for a\npost-receive hook in `.objgit/hooks/receive-pack`\n\n(this corresponds to the git\nplumbing action, the name can and will be changed) in the tree of the commit you\njust pushed. It will then spin up a\n[kefka](https://xeiaso.net/blog/2026/dancing-mad-sandboxing/) sandbox with a\ncheckout of the git repository at the commit you just pushed mounted at `/src`\n\nand mutable temporary files at `/tmp`\n\n. It gets coreutils and nothing else. No\nhost filesystem, no network, no arbitrary binaries. Output streams back into the\npusher as `remote:`\n\nlines just like when you `git push heroku main`\n\n. Eventually\nI want to make custom commands to allow you to deploy\n[Tekton](https://tekton.dev/) pipeline changes and kick off CI jobs that way,\nbut for now I’m happy with this working at all.\n\nYou can’t implement policy using these hooks yet. I’m working on it.\n\n## Now what?\n\nI taught a bucket to speak git. Where this goes next, roughly in order of how much the ideas keep me up at night.\n\nCI is the obvious next step. I would wire up commands for things like “apply kubernetes object” and “create tekton pipeline run” so that CI would run via your friendly neighborhood Kubernetes cluster and then notify you through some reasonable mechanism. That’s the first thing I’ll build when I have the time.\n\nIt would be nice to have a web UI for this, which is complicated for reasons\nthat have *nothing* to do with git trees, object storage, or anything else and\neverything to do with the current state of the internet. Git lookups are\nexpensive in the best cases and with the current torrent of unethical scraping\nransacking git servers for every scrap of RAM they have, it’s probably a bad\nidea to implement this without a lot of clever optimizations. Maybe the fact\nthat this doesn’t have load-bearing dependencies on `/usr/bin/git`\n\nwould make it\nmore resilient against scrapers. The fact that this is based on object storage\ncould also mean that caching would be a bit easier (having basically unlimited\nstorage is kind of a low-key superpower for caching), but then the main issue\nwould be server load. It’s a tough cookie to handle.\n\nPerformance and stability are another place this needs to improve. I’ve tested\nthis on my developer workstation but that is far different from testing it in\nproduction. There’s some other performance issues that are easy to fix, but the\nbig one is latency to Tigris. Maybe I can get the devops team to set me up a\n[k3k cluster](https://rancher.github.io/k3k-product-docs/k3k/latest/en/introduction.html)\nin production.\n\nRight now this is an experiment as I plug along and feel out the shape of what\ngit-on-object-storage can be. A git server with no disk, no git binary, and no\ndatabase. If you want to take a look,\n[check it out on GitHub](https://github.com/tigrisdata/objgit).\n\nobjgit is a single-binary git server that stores repositories directly in Tigris — no disk, no git binary, no database. Point it at a bucket and push.", "url": "https://wpnews.pro/news/i-taught-a-bucket-to-speak-git", "canonical_source": "https://www.tigrisdata.com/blog/objgit/", "published_at": "2026-06-24 16:04:26+00:00", "updated_at": "2026-06-24 16:38:54.570059+00:00", "lang": "en", "topics": ["developer-tools", "ai-infrastructure"], "entities": ["Tigris", "go-git", "billy", "Git"], "alternates": {"html": "https://wpnews.pro/news/i-taught-a-bucket-to-speak-git", "markdown": "https://wpnews.pro/news/i-taught-a-bucket-to-speak-git.md", "text": "https://wpnews.pro/news/i-taught-a-bucket-to-speak-git.txt", "jsonld": "https://wpnews.pro/news/i-taught-a-bucket-to-speak-git.jsonld"}}