RFC: Creating git-remote-restic

CGamesPlay · January 23, 2021, 9:51am

Hi all, I’ve been a happy user of restic for system backups for several months now. I really like the conceptual simplicity and very smooth user interface restic presents. I’d like to discuss the idea of a git-remote-restic tool. That is, a new type of git remote that stores the commits in a restic repository.

This post has my initial research on what creating such a remote would require, and I have two specific questions:

Is there presently anything which fills this niche under development? I’d hate to work on a new project when an existing one already exists.
If not, are there any obvious problems with my design? I’m passingly familiar with the restic and git internals, but I’m by no means an expert. Feedback is very much appreciated.

Why?

Git, on its own, does not provide a feature to use an untrusted remote. Any remote in the git model has full visibility into the content and history of a repository. There isn’t (as far as I can find) any existing software fits at the intersection of these:

Zero-knowledge. An untrusted third party can be an intermediary between trusted contributors.
Transparent. The software doesn’t introduce any new concepts/workflows for the users.

There are a few candidates, but they have problems:

GPG-based. git-gpg, git-remote-gcrypt, and likely others rely on GPG, which greatly increases the installation and usage complexity.
Keybase. git-remote-keybase theoretically provides these guarantees, but is closed-source and run by a single company.

How?

A custom git remote helper will be created (git-remote-restic) which will allow “restic URLs” to be used as normal git remotes. For example:

git remote add mypeer restic::s3:s3.amazonaws.com/my.bucket/git/repo
git push mypeer master

Git recognizes URLs of this form and can call out to external programs: see gitremote-helpers. git-remote-restic will be required to parse the git remote protocol and operate on an underlying restic repository.

Storing objects. Git internally uses a content-addressable filesystem similar to restic. There are 4 object types in git which are stored in restic as follows:

Blob. A git blob is the content of a tracked file. It is stored as a file node in a restic tree, with all of the metadata files set to zero (git does not track file times, owners, etc.).
Tree. A git tree is a single directory, similar to a restic tree. Git trees are stored as restic trees.
Commits and tags. A git commit is a text file that describes the commit message, author, and dates. It is stored as a single blob in restic, with the IDs of the tree and parent commits rewritten to restic IDs. Annotated git tags are stored similarly to commits.

Using this structure, we can take advantage of restic’s content deduplication, and trees can be browsed (and even restored) using standard restic tools.

The documentation says that “Blobs are of 512 KiB to 8 MiB in size”. Is this a hard limitation? If so, this limitation will be imposed on commit and tag metadata files as well. The upper bound is likely not a problem, but forcing commit messages to be padded to 512 KiB will be very inefficient.

Storing refs (branches). The refs in git are stored as a list of names mapping to object IDs. This means that with each push to a branch or tag, this data changes. In restic, a snapshot with a git-remote-restic tag is used to store this data. If multiple snapshots have this tag, the one with the most recent timestamp is preferred. The tree of the snapshot is modeled after the .git/refs folder structure, except that the content of the refs files point directly to the commit/tag blob objects (instead of being a text file listing that data). When new refs are pushed, this snapshot is modified to point to the new tree, by taking an exclusive lock of the repository.

Usage. The three operations required of git-remote-restic are list, fetch, and push. These operations loosely involve the following:

List must produce a list of known refs in the remote repository. This operation will scan the restic snapshots, find the proper one, download its tree, and produce the output. The SHAs are not required to be known at this stage, so no actual repository data is downloaded.
Push accepts a list of local refs and must replicate them to the remote repository. This is a straightforward application of the mapping process described above.
Fetch accepts a list of remote refs and must replicate them to the local repository. SHAs are calculated for objects at the same time they are downloaded.

Conclusion

Thanks for reading this far. I’m going to take a stab at this myself, but I’d love feedback on the idea or plan for implementation.

MichaelEischer · January 24, 2021, 7:33pm

I’m not aware of any such project, but I haven’t been searching for anything like that either. I’m not entirely sure that the restic and git data storages can be properly mapped onto one another.

You’ll probably need some way to quickly map git tree IDs to restic tree IDs as these will be different.

restic also expects to have snapshots which serve as root nodes to access data and for garbage collection. The restic snapshots would probably map to git commits. However, restic is currently not really optimized for managing thousand or tens of thousands of snapshots (and therefore commits). If I’m not mistaken the hash of a git commit depends on the content of the commit object. So you’ll also need some way to map commit IDs to snapshots, whose ID is the hash of the snapshot object (which by construction and encryption differs from the commit objects).

Files smaller than 512 KiB are just stored as is. There’s no reason to add padding to those.

AFAIK git stores deltas for files which only contain small changes. There’s no equivalent for that in restic. So I’d expect the repository to be quite a bit larger than the native git format.

CGamesPlay · January 25, 2021, 6:24am

Hi Michael, thanks for replying! I spent some time over the weekend building a prototype, and after reading the code of git-remote-keybase I found a much simpler solution that doesn’t require touching the restic “internals”, which is to treat the restic snapshot as a bare git repository. So, here’s a naive implementation of the program, but feel free to skip to the end for actual restic questions:

To push to a repository, create a “bare” git repository somewhere and have git push to that. Create a restic snapshot of this bare repository.
To pull from a repository, restore the latest snapshot to a local directory. Have git pull from this local directory.

This toy solution will produce the same result as an integrated solution, but it does not handle pushing from multiple sources (doing this properly requires acquiring an exclusive lock on the repository), and is very unoptimized. I’ve got a more detailed description of the integrated process written down, but it has relatively little to do with restic so I think it would be off-topic here (if anyone is interested, please contact me).

My proof-of-concept is presently able to do the pulling half of the equation, which it accomplishes by mounting a restic snapshot in a VFS that go-git can read, which is then used to populate the local repository. My next step will be to create a VFS that can build a restic snapshot.

I had to fork restic to build this, because everything lives in restic/internal. My fork simply renames this directory to restic/lib (and updates internal references), with no other changes.

What are the plans for exposing an API for restic? Is this something I could help work towards?

As a consumer of this API, I’ve only needed to access restic/internal/restic, except for when I actually instantiate the repo (there, I need to access the backends, limiter, restic/internal/repository, and a few others). It certainly seems like restic/internal/restic is gearing up to be the public API of restic. I’ve found rough edges around instantiating repos, and I’m sure that writing snapshots will expose more, and generally things in this package need to be documented. Is this something you’d be amenable to pull requests on?

MichaelEischer · January 26, 2021, 8:19pm

There are currently no plans to provide a stable interface besides the existing CLI (and its JSON output) of restic. restic/internal/restic contains both data structures (which will only change for a new repository version) and lots of interfaces and helpers. The latter parts require changes every now and then, such that maintaining a stable API there would cost too much effort. So, the answer in that regard is pretty much the same as in Question: use restic as a library · Issue #1089 · restic/restic · GitHub .

Adding a bit more documentation to the source code sounds reasonable. Could you be a bit more specific what you think should be documented? I’d like to keep the amount of documentation at a manageable quantity, after all the documentation also needs maintenance and outdated documentation could be even worse than having none at all.

CGamesPlay · February 1, 2021, 4:23pm

OK, so I’ve published a working version of git-remote-restic, which is able to use git push to create a new snapshot in a restic repository, where the snapshot’s tree is a bare git repository. Similarly, it can use git pull to clone directly out of such a restic repository without requiring a separate step.

I understand that you aren’t interested in exposing a programmatic API to access restic repositories at this time. Still, I’m including my notes on my experiences using the restic API programmatically, so feel free to completely ignore the rest of this post. However, the notes include things that might be useful to future restic developers, and it might be worth adding extra documentation or checks about these things:

In general, someone interested in using restic as a library is mostly able to get along using exclusively restic/internal/restic. If a public API were exposed, it should probably be this package.
However, the go compiler prohibits including restic/internal/restic from other packages, therefore a fork is required to rename this package to something else. I chose restic/lib/restic, although I think restic/pkg/restic might be more idiomatic?
The contents of git-remote-restic/cmd/git-remote-restic/restic.go are almost verbatim extracted from restic/cmd/restic/global.go. I think the openResticBackend function is an excellent candidate to be moved into restic/internal/restic.
restic/cmd/restic/lock.go is useful to anyone using restic as a library, therefore this code should be moved into restic/internal/restic as well. The cleanup code would likely need to remain per-application since it’s effectively dealing with the restic command’s global state.
Successfully opening a repository involves calling LoadIndex after SearchKey has succeeded. This tripped me up when I was developing; documentation might be nice. Failure to call LoadIndex results in “id not found” errors, so a debug assertion may also be useful to future developers.
Successfully creating a snapshot requires calling Flush before closing. Failure to do so results in no errors and an invalid snapshot. An assertion in Close to cause an error if there are remaining uncommitted chunks may be useful here.
Even with restic/internal/restic, actually reading or writing from repositories is not possible without knowledge of the internals of restic. Exposing a VFS like spf13/afero should certainly be a part of any public restic API. The restic/internal/fuse package is a good starting point for a read-only version of one. My git-remote-restic/pkg/resticfs package could be a starting point for a writable version; it’s not yet sufficient for general usage but does allow reading from snapshots and creating new ones (it targets go-billy, which is less popular than afero but used by go-git).

As always, thanks for reading. If you are interested in any of the changes I’m proposing here, then I’d be interested in helping to write a PR for them.