File-system repository overlays with symlinks (for both `rest-server` and `restic`)

ciprian.craciun · February 12, 2024, 6:05pm

What is the problem I’m trying to solve?

I currently have a “long-term” Restic repository (stored on the file-system, accessed via rest-server), whose “long-term” snapshots are to be kept indefinitively; (or until I run out of storage;)
I synchronize this repository offsite with rsync, over WAN (thus high-latency, low-bandwidth) to a thin-client with a USB attached disk (thus limited IOps and bandwidth), allowing only new files, never deleting, never updating; (thus when reindexing or pruning a manual synchronization is needed to remove obsolete files;)
when creating the “long-term” snapshots I make sure to exclude a bunch of folders that are either volatile (temporary or cached files), frequently changing (like virtual machine images), or “regenerable” files (either publicly available from other places, or derivable from source material, like thumbnails;)
now I want to be able to create “full-short-term” snapshots without excluding anything, snapshots that I can easily prune, but without disturbing the rsync-based synchronization of the long-term repository;

Thus, I need to use a new “short-term” repository, because I don’t want to run restic forget on the “long-term” repository. (Both due to the rsync limitation, and because I really care about those snapshots and don’t want to delete any by mistake. In fact I keep a copy of the actual “snapshots” files in Git, but if the packs they refer to go missing, they are pointless…)

However, given that the “short-term” repository would contain almost all the same data like the “long-term” repository (perhaps >75%), I want to be able to share data from the “long-term” repository, but only one-way “long-term” towards “short-term”, and never the other way around.

So, the question I’m posing to the Restic community is how to achieve this? (If it is even possible.)

My current idea is something on the following lines:

initialize the “short-term” repository by copying the parameters from the “long-term” one; (this will make packs compatible between the two;)
symlink packs from the “long-term” repository into the “short-term” one, thus achieving the one-way sharing between the two;
from time-to-time, create a new “short-term” repository, link the current packs from the “long-term” one, and copy (via restic copy) the last few snapshots, then replace the old “short-term” repository with the new one;

Instead of symlinks to packs, I could use other techniques, each with its own problems:

hardlinks – should be transparent to Restic / rest-server, but not explicit enough for the user;
reflinks / COW on file-systems that support the feature; (Btrfs, Bcachefs, XFS, but no Ext4 the one I’m using right now;)
FUSE or kernel overlay mounts – perhaps overly complex…

Now, based on what I understand from the way Restic / rest-server works, this setup should work just fine, as long as:

I don’t run restic prune or restic forget in the “long-term” repository; (unless I use hardlinks or reflinks / COW;)
I don’t run restic prune with repacking on the “short-term” repository, because that might start copying data from the “long-term” repository, thus defeating the sharing;

I am assuming all this is safe, because neither Restic nor the rest-server would ever update existing files, it would always create new files and move them into the proper folder. (Is this assumption correct?)

Building upon this idea, I could even envisage something like a multi-tiered approach to pack files storage:

have something like S3 or B2 mounted as a file-system for “really-long-term” storage;
have a repository initialized on that S3 mount-point; (assume that we have a full-repository here;)
have another mount point where a fast file-system is mounted (say something like RAID1 over SSDs;)
copy all the repository folders from S3 to this local filesystem, except the data folder;
create a new data local folder, and symlink files from the S3 data folder;
always use the local repository for current backups;
have a background job that copies files from the local data folder to the S3 one, then replacing them with symlinks; (the same synchronization should be done for the other folders;)

MichaelEischer · February 17, 2024, 4:59pm

The only way to achieve that is to copy the config file and the keys folder from the long-term repository. However, then both repositories will have the same ID which will lead to unexpected interactions with restic’s cache.

Existing files are never modified. But you have to ensure that deleted files don’t randomly reappear, as otherwise the repository might become corrupted. As long as that is guarantee this overall approach might work. But it might stop working at some point in the future as it’s not supported.

Restic expects being able to randomly access each file in the repository. But the most important question here is whether that setup will be reliable / simple enough to allow you to restore your data when you need it.

ciprian.craciun · February 18, 2024, 10:35am

Indeed, I’ve forgotten about the fact that “copying parameters” implies actually copying the repository-files, and not using restic init. (Although, thanks for reminding, as I would have certainly forgotten about this one when experimenting…)

However, I didn’t know about the repository ID. Is there a way to reset or change the ID? (I easily patch Restic myself to allow specifying the ID, but perhaps there is a simpler alternative…)

Also, as an additional question, given that two repositories share mostly the same files (and one is a strict subset of the other one), and have the same ID (as the result of copying the config), and thus share the same cache, is this actually something that might break the backup process? Because, as I see it, in the worst case the cache might contain unwanted index files belonging to the other repository.

Given that deleted snapshot files don’t re-appear in the repository, my take is that both pack and index re-appearing shouldn’t irrevocably break the repository: a prune and re-index should remove the re-appeared files, correct?

Is there another way in which the repository can be broken if especially pack files re-appear?

I wonder how many other non-standard uses are out there in the wild.

BTW, when I choose Restic as my backup solution (previously I’ve used rdiff-backup), after reading the specification (especially the storege parts), the simplicity and flexibility of Restic was one of the key deciding factors. Thus, I hope these two properties would remain the same long-term.

MichaelEischer · February 18, 2024, 4:47pm

There’s no official way to do it. I’ve rebased my experimental branch Commits · MichaelEischer/restic · GitHub . Compile restic using go build -tags debug ./cmd/restic, then you can use the restic debug changeID i-understand-that-this-could-break-my-repository-and-i-have-created-a-backup-of-the-config-file some-random-id command. The first execution will fail, then replace some-random-id with the expected value.

Restic removes files from the cache that no longer exist in the repository. That can lead to cache thrashing depending on how far the repositories have diverged.

If an index file reappears, then backups can end up referencing pack files that no longer exist but are still part of that index. That can lead to broken snapshots. pack files are unproblematic in that regard, restic will just ignore them as long as they are not included in an index file.

Probably quite a few. It’s unlikely that the design principles of the repository format will change, but things might still break in subtle way when the repository format evolves. As long as you stay on a fixed repository format version, then things are unlikely to break.

ciprian.craciun · February 21, 2024, 7:18pm

First of all, thanks to @MichaelEischer for providing the changeID debug command that allows to change the repository ID. (I’ve updated the patch to apply correctly over the latest Restic release.)

Now, I’ve tried to implement my idea with symlinks, and restic check (directly on the folder) complains with the following, most likely because it stats the symlink and not the actual file:

pack 3247...: unexpected file size: got 88, expected 135555555

Using hardlinks seems to do the trick. (Although, and I haven’t tested this, perhaps when using rest-server it would just ignore the symlinks and treat them as proper files.)

I’ll experiment these days and report back.