Archival of rsync/hardlink based backup repositories (NTFS)

Sina · October 30, 2019, 8:46am

(Not sure if this is off-topic here. I’m sorry in case you think it is. Feel free to change the category.)

My situation:
rsync/hardlink based backup repositories on several NTFS formatted external HDDs.
Planning to completely rely on restic backup (plus some static archive) from now on

I’d like to archive a “superset” of all the rsync-based backuped fles before dissolving the old repositories (losing the snapshot hierarchy is at least ok if not even preferred).
(just in case s.th. important had been deleted from backup source that shouldn’t have been deleted for some reason)

Anyone having tried that before? I’d be interested in any howtos / best practice / hints or other thoughts about this.

Thanks.

Nev · October 30, 2019, 12:23pm

Hi @Sina
If I understand your question correctly, I think I recently went through a similar process.

I moved to restic from a backup system based on rsnapshot (rsync + hardlinks). ext4-formatted rather than your NTFS, but I don’t think that matters.

I wanted to move the original repository from a usb HDD into restic as cleanly as possible. The objective was to:

Maintain the same “snapshot” structure
Tag snapshots as being from the old system
Store the original snapshot creation times
Maintain the same paths between snapshots.

While (4) was unnecessary (restic deduping would work, irrespective of path), it made browsing old snapshots a bit more elegant.

Once mounted, my old repository was at the path /media/veracrypt2/ext4_4tb/backup/. This directory then contained multiple snapshots, each within their own directories (daily.0, daily.1 etc.).

To keep it simple, here’s a tidied-up version of the BASH script which achieved only (1-3) above. Hopefully I didn’t break anything during the tidying, and apologies for my mediocre BASH abilities:

export RESTIC_REPOSITORY=<path_to_new_repo>
export RESTIC_PASSWORD=<password>
export RESTIC=./restic_0.9.5_linux_arm

RESTIC_TAG=rsnapshot
SOURCE_ROOT=/media/veracrypt2/ext4_4tb/backup/

for RESTIC_SOURCE in $SOURCE_ROOT/*/; do
	    RESTIC_TIME=`date "+%Y-%m-%d %H:%M:%S" -r $RESTIC_SOURCE`
	    echo $RESTIC_SOURCE "$RESTIC_TIME"
	    time $RESTIC backup --tag $RESTIC_TAG --time "$RESTIC_TIME" "$RESTIC_SOURCE"
done

In order to achieve (4), I made a symbolic link to each snapshot directory before each restic run. I’m guessing there might be a cleaner solution, but I’m a newbie, and it got the job done. Since the symlink was in a root-owned directory I also had to sudo a few of the commands (an unnecessary complication for here). Here’s the additional variable needed for my case, along with the updated for loop:

RESTIC_SYMLINK_PATH=/mnt/rsnapshot

for RESTIC_SOURCE in $SOURCE_ROOT/*/; do
	    RESTIC_TIME=`date "+%Y-%m-%d %H:%M:%S" -r $RESTIC_SOURCE`
	    echo $RESTIC_SOURCE "$RESTIC_TIME"
	    #make the symbolic link, and point the restic command here
	    sudo rm "$RESTIC_SYMLINK_PATH"
	    sudo ln -s $RESTIC_SOURCE "$RESTIC_SYMLINK_PATH"
	    time $RESTIC backup --tag $RESTIC_TAG --time "$RESTIC_TIME" "$RESTIC_SYMLINK_PATH"
done

This process obviously took quite a while to run, and I was forced to stop it halfway through for other reasons. A hackish way to resume was just to enclose the restic command in an if statement and manually exclude a list of already-completed snapshots.

Hope that’s of some help.

First-time-posting happy restic newbie: nice work guys - your tool is great!

uok · October 31, 2019, 11:37am

@Nev, thanks for sharing

I tried your script but it does not work, because restic just makes a backup of the symlink and not the linked folders/files. Can you help here?

Also can you please explain step (4) - how does this help or is it just cosmetic?

Sina · October 31, 2019, 12:09pm

Thanks for sharing all those details about your approach!

To make sure I created another topic.

To make sure I understand right: You mean filesystem paths?

Did you identify the main reason for the high amount of time? Do you have any guess about the “overhead” caused by the hard links - compared to a single file scenario (without hard links)?

Sina · October 31, 2019, 12:13pm

Please note the difference between hard links (meant here) and symlinks.

Your observation is intended behaviour. The docs say: " Symlinks are archived as symlinks, restic does not follow them. When you restore, you get the same symlink again, with the same link target and the same timestamps."

uok · October 31, 2019, 12:20pm

yes, but @Nev’s script uses ln -s as source.
As a workaround I found mount --bind to create a “hardlink” for a folder

Nev · October 31, 2019, 12:32pm

@uok @Sina you guys are fixing my error faster than I can type . Yes, the symlink problem was my mistake when simplifying the script for posting. In reality, I pointed restic one level through each symlink, avoiding (and never considering) the issue you encountered. And yes, (4) is purely cosmetic.

@Sina, the slow runtime I encountered was simply a function of the amount of data: I had about 20 snapshots on the drive, with each being around 1tb in size (although the total archive size was <2tb due to the numerous hardlinks). I don’t think there’s any hardlink runtime overhead. It’s just that restic had to analyze each snapshot in turn, therefore scanning the full “20tb” of data.

uok · October 31, 2019, 12:40pm

No worries, I figured it out
I’m also facing the same long migration time ahead, converting more than 600 daily backups with 200GB each which means restic needs to scan a whopping 120 TB data - that will take a while

Sina · October 31, 2019, 12:46pm

I hope, restic will recognize multiple hard links as such (on a per backup run basis), so having to read not much more that the <2tb in your case while scanning when backing them up in a single backup run. Like “Oh, this is a hard link to a file I have already acanned. So let’s store just its path/name, date information, etc. and not scan the whole contents again.”

Did your hard links of unchanged files differ in any way? Otherwise I would have expected restic to treat them like unchanged and so saving scanning time.

uok · October 31, 2019, 12:53pm

true, I did not think of that - the scanning should go a lot faster as restic finds pointers to the same files.
I’m currently running migration 1 of 600 and will report back.

uok · November 1, 2019, 10:09am

Looks like restic really does its job: after backup of the first dirvish folder the following backups are processed much faster. There is hope for my 600+ backups

Sina · November 1, 2019, 10:56am

How much faster? I would be intereted in some example durations.

Sina · November 1, 2019, 11:09am

@Nev Do you mean something like

"$RESTIC_SYMLINK_PATH"/subdir

instead of

"$RESTIC_SYMLINK_PATH"

in the restic backup command?

“subdir” being static or changing throughout all your calls?

I still wonder whether restic took advantage of the high similarity among your backups only in terms of deduplication or also descreased backup duration. Non-static absolute paths would probabbly decrease chances to benefit from the high similarity and being partially hard linked…

uok · November 1, 2019, 11:27am

I’m sorry I don’t have data on that. I don’t know how much has changed in each backup that I’m importing so there is no equal base to compare. The first backup (after restic init) took several hours, now it is less than 15 minutes if not many files have changed since last day.

@Nev’s idea with the fixed path also helps as Restic only needs to scan for file changes.
The paths are already known to Restic, so no extra scanning needed.

So instead of scanning

/source/dirvish/2019-10-09/my/data/folder
/source/dirvish/2019-10-10/my/data/folder

it only needs to scan

/mnt/dirvish/my/data/folder
/mnt/dirvish/my/data/folder

Sina · November 1, 2019, 11:56am

The data you named (several hours vs 15 mins) in fact was the data I was interested in (Assuming fairly unchanged data.)
Thanks for sharing!

…which doesn’t differ from your idea/implementation, does it? (except using some mount mechanism instead of symlinking)
Or did I miss something about your implementation?

Still confusing me is the following:

Sounds like he didn’t get benefit from the similarity amoung all his backups…

Nev · November 1, 2019, 5:13pm

Sina:

@Nev Do you mean something like
"$RESTIC_SYMLINK_PATH"/subdir
instead of
"$RESTIC_SYMLINK_PATH"
in the restic backup command?

“subdir” being static or changing throughout all your calls?

Yes, precisely. Subdir was added after that variable, and was the same each time - the only dynamic part was handled by the symlink. So I suspect, as @uok said, restic just saw it as the same directory structure.

I didn’t benchmark my run at the time, so dangerous to comment too confidently on performance. After the first snapshot was imported, I remember subsequent ones were definitely faster. Don’t know if that was purely because deduping was avoiding things being written (which in itself is a nice speedup), or if there was more intelligent scanning of the source. I was probably also being impatient (a watched progressbar never completes). On balance, I would trust @uok’s current experience a lot more than my own recollection .