Strategies for one-off back-filling with large amounts of data

Hi,

I’m just starting to use restic and I’m wanting to import a lot of data from my existing backups which used rsnapshot.

I’ve already imported the most recent iteration of the rsnapshot, that being its daily.0 directory tree, so I’m satisfied that I could set backups going again from this point.

Judging by the smaller on-disk size of the daily.0 that is now in restic though, it looks like I would get a healthy amount of disk space back due to restic’s compression and better deduplication. I am thinking of importing the whole rest of the rsnapshot directory tree in.

That is several hundred million files though, mostly hard links, coming off of HDD-based storage, It won’t be fast to do this. I am concerned about locking.

My repo is accessed by the rest-server. I know from my research that concurrency is possible, but will any such long-running backup job prevent me from doing routine maintenance like adjusting the tags that are on snapshots?

I think it will prevent me from doing any forget, prune or purge until it completes but should not interfere with all the client machines doing their regular backups, right? If so I am okay with that.

The top level of my old rsnapshot backups is a big list of directories like daily.0 through daily.6 and then more for weekly and monthly. Is there any benefit to be had in splitting the import up into one snapshot per top-level directory?

It would give me a window for doing any of the things that require a repository lock but I would expect to script the whole import thing so wouldn’t necessarily know when those free times would be. Maybe I could script it to only happen between certain hours…

I think that’s as specific as I can get with the questions, just asking about general strategy here really.

Thanks,
Andy

Yes, this is correct. Any operations that require an exclusive lock on the repository won’t be able to get an exclusive lock while the long backup is ongoing. You might find this list of operations mapped to lock types helpful: List of lock rules in documentation - #2 by cdhowie

Doing this would split the data ingest down into smaller chunks, but how useful that is/isn’t depends on how long it takes your system to do the full data ingest.
The biggest benefit I can think of would be if you were to use restic rewrite to give each new snapshot the correct “created time” (Working with repositories — restic 0.18.0 documentation). restic forget could then expire the older rsnapshot snapshots on your regular schedule. However this would be a fair amount of manual work, so I’m not sure if it is worth the hassle.

I think this is possibly overcomplicating things. Personally I’d simply not do forget/prune operations until the ingest from rsnapshot is finished. Anecdotally, I run forget+prune about once a month personally for my “on-site” repo, so it wouldn’t be a great hardship for me to suspend the scheduled forget job while a long-running backup job was going. You also probably don’t have much to forget yet, as it sounds like you just started taking new restic backups :slight_smile:

2 Likes

@grifferz my advise : Scanning can take long and hardlinks are a bit tricky for restic (details below). Do some trials and then decide on your strategy.

BACKGROUND

I have some experience with restic+hardlinks, I had a script that created daily hardlink snapshots of a synchronised data set, which were then backed up weekly with restic. This dataset was about 100k files and 200GByte in size.

The experience is that the scanning takes increasingly longer as restic investigates each new hardlink. So even though it de-duplicates at fantastic rate if your data is mostly static, the scanning took very long, also check operations iirc.

What I ended up doing is to abandon the daily hardlinks and just execute a daily restic backup on the synchronisation data. I also created a new repo to start fresh and fast. So far the speed keeps up and I am several years in the daily backup.

ALTERNATIVE

instead of backing up your rsnapshot data (is that what you mean with “import”?), you could script a restore for each of your rsnapshot ‘snapshots’.
If you put that always in the same reference rootfolder, then you backup that with some useful tags into a restic repo. That way I think you can avoid the burden of hardlinks.

Hope this makes some sense.

1 Like

Thanks. I have just over 411 million files/hardlinks in my rsnapshot file tree and I intend to stop using rsnapshot because it’s now extremely painful to traverse all that to manage it in any significant way.

So when I say “import” it’s because it’s actually my intention to get as much from rsnapshot into restic and then only use restic going forward. The initial backup would be from the rsnapshot host but subsequent backups would be from each individual host.

As rsnapshot has many “interval” directories (e.g. daily.0, daily.1, etc.) with subdirectories for the individual host below those, I don’t feel the need to try to run restic backup on the whole lot at once. I can run restic backup on each host subdirectory in turn and then set its time with a rewrite --time based on the time of the directory it is in.

I have found it manageable to do this and in fact have written a script to do it for an entire interval of the rsnapshot store. It has got as far as all seven daily.* and all four weekly.* intervals, though it does take many hours for each interval.

I really don’t like the lack of support for relative backup paths though. I’m evaluating rustic at the moment for the actual backup jobs because it does support relative paths and also lets me set the time on the backup command rather than having to rewrite it afterwards. I don’t know if I will end up using that but if I do then I will have to continue doing so.

Still evaluating!