Hi all, I’ve been a happy user of restic for system backups for several months now. I really like the conceptual simplicity and very smooth user interface restic presents. I’d like to discuss the idea of a git-remote-restic
tool. That is, a new type of git remote that stores the commits in a restic repository.
This post has my initial research on what creating such a remote would require, and I have two specific questions:
- Is there presently anything which fills this niche under development? I’d hate to work on a new project when an existing one already exists.
- If not, are there any obvious problems with my design? I’m passingly familiar with the restic and git internals, but I’m by no means an expert. Feedback is very much appreciated.
Why?
Git, on its own, does not provide a feature to use an untrusted remote. Any remote in the git model has full visibility into the content and history of a repository. There isn’t (as far as I can find) any existing software fits at the intersection of these:
- Zero-knowledge. An untrusted third party can be an intermediary between trusted contributors.
- Transparent. The software doesn’t introduce any new concepts/workflows for the users.
There are a few candidates, but they have problems:
- GPG-based.
git-gpg
,git-remote-gcrypt
, and likely others rely on GPG, which greatly increases the installation and usage complexity. - Keybase.
git-remote-keybase
theoretically provides these guarantees, but is closed-source and run by a single company.
How?
A custom git remote helper will be created (git-remote-restic
) which will allow “restic URLs” to be used as normal git remotes. For example:
git remote add mypeer restic::s3:s3.amazonaws.com/my.bucket/git/repo
git push mypeer master
Git recognizes URLs of this form and can call out to external programs: see gitremote-helpers. git-remote-restic
will be required to parse the git remote protocol and operate on an underlying restic repository.
Storing objects. Git internally uses a content-addressable filesystem similar to restic. There are 4 object types in git which are stored in restic as follows:
- Blob. A git blob is the content of a tracked file. It is stored as a file node in a restic tree, with all of the metadata files set to zero (git does not track file times, owners, etc.).
- Tree. A git tree is a single directory, similar to a restic tree. Git trees are stored as restic trees.
- Commits and tags. A git commit is a text file that describes the commit message, author, and dates. It is stored as a single blob in restic, with the IDs of the tree and parent commits rewritten to restic IDs. Annotated git tags are stored similarly to commits.
Using this structure, we can take advantage of restic’s content deduplication, and trees can be browsed (and even restored) using standard restic tools.
- The documentation says that “Blobs are of 512 KiB to 8 MiB in size”. Is this a hard limitation? If so, this limitation will be imposed on commit and tag metadata files as well. The upper bound is likely not a problem, but forcing commit messages to be padded to 512 KiB will be very inefficient.
Storing refs (branches). The refs in git are stored as a list of names mapping to object IDs. This means that with each push to a branch or tag, this data changes. In restic, a snapshot with a git-remote-restic
tag is used to store this data. If multiple snapshots have this tag, the one with the most recent timestamp is preferred. The tree of the snapshot is modeled after the .git/refs
folder structure, except that the content of the refs files point directly to the commit/tag blob objects (instead of being a text file listing that data). When new refs are pushed, this snapshot is modified to point to the new tree, by taking an exclusive lock of the repository.
Usage. The three operations required of git-remote-restic
are list, fetch, and push. These operations loosely involve the following:
- List must produce a list of known refs in the remote repository. This operation will scan the restic snapshots, find the proper one, download its tree, and produce the output. The SHAs are not required to be known at this stage, so no actual repository data is downloaded.
- Push accepts a list of local refs and must replicate them to the remote repository. This is a straightforward application of the mapping process described above.
- Fetch accepts a list of remote refs and must replicate them to the local repository. SHAs are calculated for objects at the same time they are downloaded.
Conclusion
Thanks for reading this far. I’m going to take a stab at this myself, but I’d love feedback on the idea or plan for implementation.