Git-annex to manage a restic repo?

gorlins · March 14, 2021, 11:35pm

I am just reading up on restic over the last week or so and thinking of switching. I’m wondering if anyone has tried using restic this way or if it doesn’t make sense at all…

I would like to find a way to offload most of my repo to s3 deep glacier for the savings. Reading the forums it seems like this does not work well at the moment. I also would not mind having multiple repo copies.

Would it work to create a local restic repo and commit the restic repo into git annex in order to push it to glacier (and maybe other places, eg a nas)? Then even drop some of the local data objects locally (assuming I have another local copy eg nas) and get the speed of local incremental/deduplicated backups with ability to offload to glacier? Or is this a terrible idea?

gurkan · March 15, 2021, 9:02am

I am not familiar with git annex, but if the homepage gave me an idea correct enough: This can be a bit hard with restic repositories. Since what you’ll see on the repository itself is some hashed filenames with encrypted contents (e.g. data/de/def36db6e32633ff7ec9fb42f12258fb58d41f540a4f1b33f468762e9da6bb35), looks like your only choice to have a commit on annex after each backup, to know what is used by what. Even then, you won’t be able to differentiate completely since restic does deduplication:

You backup something, restic puts some data files on the repo
You backup something else, which has some deduplication, so restic puts new data, but the new snapshot needs some old files under data/ , which is not visible to you since these files are not changed

(I might be not understanding the plan or git-annex well, so consider my answer as a theoretical mumbling.)

gorlins · March 15, 2021, 12:54pm

So I tested this out and can confirm that it ‘works’, at least on the surface level. I also see some references to people using this workflow with borg, but they are sparse on details. Eg Backing up with borg and git-annex | Blog of Julian Andres Klode and suggest remote backup storage options? · Issue #2177 · borgbackup/borg · GitHub. Oddly, git-annex has official support for the opposite approach - committing files to git-annex, and then storing the repo in borg, vs the other way around which I am discussing.

Here is the workflow, and what happens. Note that the restic repo still ‘looks’ normal:

mkdir my_repo
cd my_repo
restic -r . init

git init
git annex init
git annex add .
git commit -m initial commit

Now, the repo looks like a normal restic repo. The only difference is that every file is a symlink to another file under .git/annex/objects. But basic stuff seems to work:

restic stats
restic backup /some/files

git annex add .
git commit -m update

Now the fun part - create a second annex repo as the backup: walkthrough

# if you git annex move . --to usbdrive, then things break, since config is no longer present
git annex move data/ --to usbdrive

restic check # still works!

# Everything still seems to work... confirmed incremental deduping works too
restic stats
restic backup /some/files
git annex add .
git commit -m update

After the git annex move data/ --to usbdrive the data/ folder still has symlinks, but they are broken (until the data is restored at a later point). My initial testing though indicates that the data is not needed to actually work w/ restic, do backups, etc - i assume it would only be needed to restore or check the integrity of the repo.

cdhowie · March 17, 2021, 6:02am

I genuinely have no idea what the goal is behind this approach, or what the perceived benefit will be. Restic’s repository structure is actually very close to git itself, so this seems a bit like using git to version a git repository. What is the point?

gorlins · March 18, 2021, 12:29am

The point would be to encrypt/backup once and distribute the backup (to potentially multiple) offline. Eg use a fast local drive or nas for the backup repo, and then replicate that to glacier, which restic as I understand it does not like. Git-annex provides a bit more formality than eg rclone/rsync for copying/sharing a single restic repo around (vs if you backup restic to two repos, the backups are not identical since they encrypt or even chunk differently if care is not taken).

This very well could be a Rube Goldberg, but I saw some buzz around this with borg, so I thought maybe worth exploring

cdhowie · March 18, 2021, 1:48am

It seems like git-annex is just an intermediary way to shuffle files around, which doesn’t solve the underlying problems with using cold storage tiers with restic. If the indexes become damaged, for example, you need to run rebuild-index and then you need all pack files accessible, which will require costly restore fees. You can also pretty much never prune, as the amount of money you save is going to be miniscule in comparison to the restore/transfer charges required to figure out what data is used and repack used objects into new packs.

Rclone directly to S3 with the various Glacier tiers makes a lot more sense than introducing git-annex, unless there is something git-annex is doing that’s going over my head. But the fundamental problem remains – sometimes, restic needs to be able to access all pack files on-demand, and cold storage is fundamentally incompatible with this.

alexweiss · March 18, 2021, 5:25am

Note that with 0.12.0 this is no longer true. rebuild-index only needs to access pack files which are not fully covered by the present index (unless you specify --read-all-packs). So if your index is lost, all pack files will be accessed. But if your index is damaged and missing only few pack files, those are the only ones to be accessed.

About cold storage discussions, I also opened the following issue:

github.com/restic/restic

Discussion: How to support cold storages?

opened 12:15PM - 30 Dec 20 UTC

aawsome

category: backend category: user interface type: feature suggestion type: tracking

This issue is meant to discuss which is the best way to implement the support of… cold storages in restic. The need to use restic with cold storages has been addressed in several places, see #1903, #2504, #2611, #2796, #2817 and several discussions in the forum. There are also some experimental PRs which kind of allow users to use cold storages, see #2516, #2881, #3196 and the already merged prune improvements (#2718 and following PRs) are already made to support cold storages in future. What should restic do differently? Which functionality do you think we should add? ---------------------------------------------------------------------------------- Allow users to use repository on some "cold storages". These are usually cheap (cloud) storages where writing is fast and cheap, storing data is extremely cheap, but accessing the data is usually extremely slow (or may even need some extra "warming up" before being able to access it) and/or expensive. Some cold storages also require minimum file sizes or are expensive for small files, but this is not specific to cold storages, so I would like to skip this in the discussion here. Examples of cold storages are: - AWS S3 Glacier and Glacier Deep Archive - Google Cloud Platform Coldline Storage and Archive Storage - Azure Blob Storage Cold and Archive - OVH Cloud Archive - Maybe self-programmed backends which write to a tape What are you trying to do? What problem would this solve? --------------------------------------------------------- Allow restic to use cheap storages for use-cases where access to file contents is usually never needed, or users are willing to accept the trade-offs that come with those storages. A use-case is disaster-only backups. What issues need restic to tackle? --------------------------------------------------------- 1. Define how to split the repository in hot/cold parts 2. Reduce read access to the cold storage as much as possible 3. Reduce all API calls to the cold storage as much as possible 4. Add functionalities to get information which data from cold storage will be needed (+ maybe implement some "warmup" possibilities) 5. Add functionalities to let users restrict the access to cold storage, where possible 6. Allow restic to wait for slow cold storage access Maybe some more features are needed... Define how to split the repo in hot/cold parts --------------------------------------------------------- First, I think we should talk about the treatment of pack files containing data blobs on one side and all other files in the repo on the other side. The reason is, that usually more than 99% of the repo size is occupied by those files. Only for degenerated case (many very small files), the tree blobs and the index may contribute significantly to the repo size. Moreover there are a couple of commands that do not need to read/access any "data pack file" and should therefor fully work if only these are located in the cold storage: - `backup` - `cache` - `copy` (destination repo) - `diff` - `find` - `forget` (but not the `--prune` part) - `init` - `key` - `list` (except packs) - `ls` - `recover` - `snapshots` - `stats` - `tag` - `unlock` Using different paths in the repo to save packs containing tree blobs and packs containing data blobs was discussed in https://github.com/restic/restic/issues/628#issuecomment-635506248. This would work for storages that allow to separate hot/cold by paths, like AWS S3. For storages missing that feature however, this would not work and it requires a change of the repo format. I'll call this approach "split-path-approach". Another (similar) possibility would be to use two repos, "repo-cold" (saving "data pack files") and "repo-hot" (saving all other files). This again would be a new repository format. I'll call this approach "split-repo-approach". A third solution is to have a cold repo containing all files (which would then be a standard restic repo) and saving all files expect "data pack files" in a "repo-hot". This approach is in fact a caching approach, so I'll call it "cache-approach". Note that #2516 kind of implements this by using the local cache as "repo-hot", #3235 implements this for a more general "repo-hot". Reduce read access to the cold storage as much as possible --------------------------------------------------------- All three approaches would only need to read data from the cold storage when restic accesses a " data pack file". This is the case for all commands that really need to access file contents. `prune` is already optimized such that it only accesses files that are marked for repacking, `rebuild-index` is already optimized that it should only reads pack files which are not or not correctly contained in the index. Are there other commands that need optimization here? Reduce all API calls to the cold storage as much as possible --------------------------------------------------------- API calls to the "/data" dir are only `Save` (for backups etc.), `Load` (see above) as well as `List` for `list` (packs), `prune`, `check` and `rebuild-index`, and `Remove` for `prune` which I think cannot be improved. So the "split-path-approach" and "split-repo-approach" would already have minimal API calls. In the "cache-approach" the cold storage backend would additionally get every `Save`, `List` and `Remove` for non-"/data" files and "tree pack files". Actually, I don't know about the other API calls, like `Test` and `Stat`. Add functionalities to get information which data from cold storage will be needed (+ maybe implement some "warmup" possibilities) --------------------------------------------------------- I think for most commands, the best would be to implement a `--dry-run` or `--warm-up` option showing which "data pack files" needs to be accessed in order to run the command. Some cold storage do warm-ups when a file is tried to be accessed, this "access-try" could be also implemented for `--warm-up`. This applies to the following commands: - `cat` (for data blobs; for packs this is easy :wink: ) - `copy` (for the source repo) - `dump` - `prune` (see #2881) - `rebuild-index` (see #2881) - `restore` (see #2796) For `check` with `--read-data` or `--read-data-subset n/t` it is easy to determine which "data pack files" the check needs. For random subsets, I propose to add an option `--read-data-from` which allows users to explicitly give a list of pack files to be checked, see #3203. The command `mount` does not really work, as it is interactive and the "data pack files" needed are only known when users access them. So I would not allow to use that command for cold storages or make it just list the directory structure without allowing to read any file. Add functionalities to let users restrict the access to cold storage, where possible --------------------------------------------------------- I think this only applies to `prune`, where a users could give a list of "packs to keep" (see #3196). Moreover, users can use the already existing option `--repack-cacheable-only` which does not repack any "data pack file". In case of duplicate blobs this might be interesting for other commands, but I think this is nothing we need to start with to support cold storages, so I'd like to skip discussion about duplicates here. Allow restic to wait for slow cold storage access --------------------------------------------------------- This might need another logic for timeouts or retries. I proposed #2515, but maybe there are better approaches? Discussion --------------------------------------------------------- - Are there other requirements? - Are there approaches I'm missing? Which approach do you think should be favored by restic? - Other comments?

askielboe · March 22, 2021, 6:47pm

This is a really interesting idea. restic and git-annex are my two primary data management tools. I use git-annex extensively and really love it for it’s stability, sparseness (i.e. dropping unneeded files) and for tracking data redundancy and consistency across many different types of backends.

That said I have a hard time wrapping my head around the possible benefits here. My intuition tells me the restic repo is better left to restic to handle, but I definitely applaud the creativity!

Edit: I think the main problem with this approach is keeping track of which files in the repo to drop and which to get. It’s definitely possible, but feels like a lot of book keeping.