Restic slow to snapshot new hard links of large files

sanner · December 6, 2018, 8:14pm

New to restic but liking what I’ve seen. However, while recover testing with hardlinks works well multiple backups/snapshots containing new hard links (but unchanged data) is slow. (using xfs filesystem)

Use case: I use rsync to backup a full filesystem/disk to a 2nd disk. I then run rsync daily to generate a new snapshot of this filesystem using rsync’s “–link-dest=DIR” option. This generates a new directory tree that is mostly hard links to the previous backup/snapshot.

I was hoping I could then use restic to backup (create a cloud based snapshot) these multiple hard linked local snapshots and, as long as no data was modified, it would be quick. However it seems this is not a good use case for restic as restic need to compare the data between hard linked files slowing the process down considerably.

Is there an option/way to more efficiently handle hard links on backup?

Thank you.

cdhowie · December 6, 2018, 10:04pm

Can you give us an example of the scripts that perform these operations, including invoking restic? In particular, when you perform a backup, are the paths you give to restic always the same, or do they change every backup? My suspicion is that the path differs each time.

When performing a backup, restic always looks at absolute paths and never relative. If there is a snapshot with the same hostname and exactly the same set of paths, restic uses this as the “parent” and will look at file metadata (at least mtime and size). If the metadata is identical in the parent snapshot, restic does not scan the file contents at all and assumes the file did not change.

If restic can’t find another snapshot with the same absolute paths then restic has no parent snapshot to use (and the absolute paths wouldn’t line up anyway, so restic wouldn’t know which file corresponds to which) and so it has to process the contents of all files. Obviously it will be able to deduplicate the data, but it doesn’t know it can until it hashes each chunk of each file.

Note that this has nothing to do with hard links, necessarily. It’s just that restic is only able to skip over unchanged files efficiently if the absolute path is the same as it was in the last snapshot, and the file metadata hasn’t changed.

sanner · December 7, 2018, 8:05am

Yes, the paths differ every time but the data is unchanged (all files are hard linked). I was hoping it would go by inode similar to rsync.

Example:
I have a file in a directory: DIRA/FILE1
restic backup DIRA
mkdir DIRB ; ln DIRA/FILE1 DIRB/FILE1
restic backup DIRB

restic seems to understand these files are hard linked (and restores accordingly) as well as not duplicating the data (any data, regardless of file name) but it does send the data up to the repo each time though it’s already there.

I was hoping I could get away by using relative paths but that is not working either. ex:
cd DIRA
restic backup .
cd …/DIRB
restic backup .

But it seems restic is looking at the absolute path only and uploads the file in DIRB that is hard linked to file in DIRA. Will spend more time reading the manual!

I think this work around might do it (As above DIRA & DIRB each have one file w/ same name that is hard linked)
ln -s DIRA DIRC ; cd DIRC
restic backup .
cd … ; ln -sf DIRB DIRC ; cd DIRDC
restic backup .

Caviot (as one might expect) the example above only works if CWD is in soft linked dir, note:
ln -s DIRA DIRC
restic backup DIRC
… processed 0 files, 0 B in 0:00
(doesn’t capture any of the files under C, only sees it as softlink)

I think I can make that work w/ soft linking dir … I guess keeping track of hard links / inodes (like rsync, tar) isn’t how restic does its magic.

Thanks again.

764287 · December 7, 2018, 8:38am

I’ve used rsnapshot in the past, which does all the rsync hardlink thingy automatically. One reason why I switched over to restic is that this method wastes alot of inodes and dealing with such a huge amount of hardlinks can be quite slow too.

What’s the benefit of backing up all those hardlinks instead of the original files? restic needs to read and index everything in both cases.

cdhowie · December 7, 2018, 4:53pm

To my knowledge, inode number is not used to locate files in the repository. Consider that multiple filesystems from the same host might be backed up. In fact, multiple hosts might back up to the same repository. Maintaining an index of inode numbers of all files across all snapshots (accounting for collisions where two hosts/filesystems share inode numbers for different files) would be a large amount of new code added to restic to optimize for what is quite honestly an edge case.

No data is actually sent “up to the repo.”

Restic splits each file into chunks and hashes each chunk. These chunks are stored separately in the repository as “blobs,” using the hash of the blob contents as its ID. This way, when a file is unchanged, the restic client can simply look in the repository to see if there is a blob with the same ID. If there is, the blob is not re-uploaded and is effectively deduplicated.

What you perceive as data being sent is probably just the hashing activity on the client, which does take some time. If you measure CPU usage against network usage, you should see that the network usage is nearly nothing, which the CPU is working hard to hash all of the files.

The “parent snapshot” concept simply allows restic to skip the hashing step if the metadata of the file hasn’t changed. Without a file in a previous snapshot to compare metadata against, restic has to process all of the file’s data, but that doesn’t mean the data actually gets sent to the server unless the data doesn’t exist in the repository yet due to an earlier snapshot.

As @764287 pointed out, it would be much better to simply back up against the original files. The paths will be consistent between snapshots, which will allow restic to properly locate a parent snapshot. You should then see a dramatic reduction in the amount of time it takes to create a snapshot – if little data has been changed, this is not much more time than it would take to crawl the backed-up paths using find or a similar tool.

sanner · December 7, 2018, 6:33pm

Good to hear the data is not being sent up but the chunking does take time… that makes sense.

Agreed, as I understand a bit more of how restic works (thanks for your descriptions), it would much better to backup against the original files, but I was hoping to use restic on top of our current system where we have local access to each snapshot. I believe it can make it work with a soft link work around (above), keeping the paths the same, even though I would backing up a separate (but hard linked) snapshots.

Currently I have a few hundred of machines snapshotted daily that goes something like this:
cd machineA
rsync -aH --link-dest=DateA machineA:/ DateB
(following day: rsync -aH --link-dest=DateB machineA:/ DateC)

This does generate a lot of local hard link for every file in every snapshot but it allows for browsable snapshots for every day the machine was backed up / snapshotted. We keep a months worth of snapshots.

Current regime is that we tar each machines directory (which contain hard linked snapshot for each day) monthly. Tar understands hard links and after compressing/encrypting we push each machine’s tar file to tape. (hard link overhead is just a couple % of a hosts disk usage). Perhaps not the most elegant solution but it’s been working well for 10 years and rsync works well on the client side.

Would like to take tape out of the equation while keeping a local copy of each snapshot.

Thanks much… again.

cdhowie · December 7, 2018, 7:22pm

No worries, happy to help.

You mentioned that you wanted to keep “local access.” What we do with our production servers is back them all up with restic to a dedicated backup server in the same data center, so that we can restore quickly. But we also rclone sync this repository daily with S3 so that we have an off-site copy for disaster recovery (e.g. tornado takes out data center).

The backup server runs the REST server in append-only mode, so a public-facing system compromise does not endanger the backups.

Restic’s model of “a new backup introduces new files but does not change any” means that syncing with S3 is very fast; only the new files are uploaded.

Note that we back up all servers to the same repository. This means that all servers deduplicate with each other. At last check (several weeks ago) we had 5TB of logical snapshots (how much disk space it would take to restore every snapshot in the repository; the total backup size prior to deduplication) in a 120GB repository. With that kind of efficiency, there is no reason not to do full system backups of / (excluding transient things like /tmp and /var/cache). This has the advantage of being able to rebuild a server by booting from live media, partitioning, restoring from restic into the partition, install the bootloader, and reboot. (Yes, I tested that. It does work!)

sanner · December 10, 2018, 8:43pm

Sounds like a great setup, especially with the dedup and offsite S3 disaster recovery. Disaster recovery is why we’re looking at restic and cloud backups.

Does the REST server in append-only mode block a host/client from reading another host’s data without creating unique accounts per hosts and using --private-repos? And when using --private-repos can dedup still occur? (Since many of the machines we backup are desktops/laptops we don’t trust them at all and they need strict isolation.) If unique accounts per host are needed to provide isolation (assuming dedup still works) that would add more complexity but still would be manageable.

With our current rsync based backups all the control and file access remains on the backup server and the only setup required on a host/client is ssh keys. The backup server decides when to initiate a snapshot and since it’s “pulling” the snapshot the clients don’t have access to data on the server (assuming rsync protocol between client/server instance is secure).

Is there any such model/option with restic to allow a backup server to initiate “pulling” data from a client?

Thanks for all your help/info.

cdhowie · December 10, 2018, 8:49pm

No. If you have access to a repository, you can necessarily read everything in the repository.

Not between repositories. Currently, if deduplication happens between X and Y, then X and Y can read each other’s data.

Now this is where things get interesting. If the machines don’t have direct read or write access to the repository, then deduplication indeed might be possible.

If, on the backup server, you can mount each machine to be backed up as a filesystem somehow, then restic could read that filesystem. A huge downside of this approach is that the backup server will potentially have to pull data that is already in the repository, because it won’t know if the data can be excluded due to deduplication until it has read the data from the machine.

A possible solution would be to support two restic instances talking directly to each other: the machine being backed up can run a restic in “client mode” that sends all file metadata to the other restic in “server mode,” and the server can then request chunk hashes and chunk contents from the client as it determines what could be changed. This would preserve the efficiency of restic while allowing the server side to impose access controls.

However, nothing like this currently exists. I suspect @fd0 may be interested in the idea, but it would be a pretty large undertaking to implement this mode, and largely flows against the restic model of “the storage backend is dumb.”

Prevent hosts from reading each others' data

opened 02:28PM - 28 Aug 18 UTC

closed 12:39PM - 17 Feb 20 UTC

colans

type: feature suggestion

Restic version: `restic 0.9.1 compiled with go1.10.1 on linux/amd64` What sh…ould restic do differently? Which functionality do you think we should add? ---------------------------------------------------------------------------------- From https://github.com/restic/restic/issues/784: > Better host separation, which allows a host to read only its own data (or more generically, only read data it has rights to). This does allow a host to recover its own files, without accessing other hosts' data. What are you trying to do? -------------------------- Prevent hosts from reading each others' data. I think this would make the most sense as an `rclone serve restic` option so that each host can't override the Restic setting, but I'm not sure that rclone understands the concept of hosts. If it does, we should move this issue over there. Otherwise, we'd have to figure out a way to do it on the Restic side that prevents the option from being overridden (if that's even possible). Did restic help you or made you happy in any way? ------------------------------------------------- It's the best component of a backup system I've ever seen!

sanner · December 11, 2018, 1:42am

Makes sense deduplication only works if X and Y can see each other’s data.

Perhaps restic could still operate in a “the storage backend is dumb” mode if it could split the logic and run part on a trusted local machine with access to the repo but also operate on a remote host via an ssh (a model similar to rsync over ssh). Or access remote files by integrating something like sftp or sshfs?

In one sense this is possible now if the remote host has file sharing enabled. Just mount the remote disk on the local backup server that is running restic and backup from the mount point. (This actually how I backup windows boxes running rsync on local CIFS mount point) I haven’t tried this over sshfs yet.

cdhowie · December 11, 2018, 2:06am

Yep, that’s the concept I was trying to get at here:

I have not personally tried using sshfs with restic, but I have no reason to believe it wouldn’t work – though I’d make sure the backup server is on the same LAN to achieve a reasonable backup speed.