Trying to understand how hard links are handled by restic

unicolet · April 7, 2021, 3:10pm

I’ve been using restic for years on my computer and I’m impressed by how well it works. Now I have come across a situation where restic could help me again, but so far I’ve failed to understand if I’m using it correctly.

For context the new situation is a clickhouse instance with a few TB of data. Clickhouse uses hardlinks extensively to create backups, which we then rsync -Hav --delete to an NFS share. So far so good.

Now we’d like to send the backups to S3 as well, and that’s where we thought of restic. I have looked at issues and other forum questions, so I am pretty sure hardlinks are supported by restic. I tried backing up one hardlink on my computer, and they work as advertised. Btw, we’re running the latest version of restic.

Unfortunately when I tried to run restic against our clickhouse backups it showed a grandtotal of 35TB to process, which is way more than what we have (I did the math a few times and with a colleague, so I am confident we do not have 35TB of data ).

Anyways, I let restic run for a while, hoping the 35TB were just a display fluke. However, after a few hours restic had uploaded more data to S3 than we have, so I was at this point sure the 35TB were not a display fluke, but rather hinted at an issue with our restic usage. I stopped the backup.

The backup command is fairly simple and the target repo is created from scratch just for this purpose:

restic -r s3:s3.amazonaws.com/our-bucket/clickhouse --verbose backup /var/lib/clickhouse/backup

I have seen in another post that --ignore-ctime is suggested, but this is the fist backup, so I don’t think that will have any effect in my case.

What am I doing wrong, or am I using restic in a situation that cannot apply?

Thanks in advance,
Umberto

betatester77 · April 7, 2021, 5:07pm

I am surprised about the behavoir you Described.

I have a Server with a lot of hardlinks, I would say that 200gb of files have hardlinks. Restic handles that quite well. In fact when I watch the backup with -v the amount of data is estimated way too high like you describe it. But then when restic reaches the position where the hardlinks are, it does a “200gb jump” in a second. Also the amount of data in the repo fits in the picture.

So would be interesting what’s the difference in your use case.

unicolet · April 7, 2021, 5:29pm

very interesting, so maybe I should’ve just let restic continue? and somehow ignore the fact that it transferred more data than we have, but checked the size of the s3 bucket instead? Perhaps I jumped the gun too quickly…I’ll run another test tomorrow. Thanks for taking the time to reply

betatester77 · April 7, 2021, 5:52pm

You are welcome.
Unfortunately I have not that much knowledge about what’s restic doing under the hood as others here, so I can only describe what I expierence.
Plus:even if restic would follow the hardlinks and treat it as files, it should be perfectly deduplicated. So I can’t see any reason why your repo should get bigger than your source.

Let us know how your further tests did.

unicolet · April 8, 2021, 6:37am

I decided to run a test on my computer to better understand restic’s behavior. TL;DR: it works as advertised, minus the confusing total estimate which does not take hardlinks into account.

I created a bunch of hard links on my computer:

➜  Desktop mkdir hardlinks
➜  Desktop cd hardlinks 
➜  hardlinks dd if=/dev/urandom of=file1 bs=1024k count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 2.12821 s, 49.3 MB/s
➜  hardlinks for i in {2..10} ; do ln file1 file1_hardlink_$i ; done
➜  hardlinks ll
total 1001M
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_10
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_2
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_3
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_4
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_5
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_6
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_7
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_8
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_9

I’ll now try to upload them to s3, and observe the behavior. If all goes as intended the final s3 bucket size should ~ 100MB. Let’s see.

➜  hardlinks restic -r s3:s3.amazonaws.com/mybucket/umberto/hardlinks init                               
created restic repository d9083000f7 at s3:s3.amazonaws.com/mybucket/umberto/hardlinks
                                                                                                                      
Please note that knowledge of your password is required to access                                                     
the repository. Losing your password means that your data is                                                          
irrecoverably lost.

Here is the backup output:

➜  hardlinks restic -r s3:s3.amazonaws.com/mybucket/umberto/hardlinks --verbose --verbose backup $PWD 2>&1 | tee ../restic_backup.log
open repository
created new cache in /home/umberto/.cache/restic
lock repository
load index files
no parent snapshot found, will read all files
start scan on [/home/umberto/Desktop/hardlinks]
start backup on [/home/umberto/Desktop/hardlinks]
scan finished in 0.550s: 10 files, 1000.000 MiB
cd ..
new       /home/umberto/Desktop/hardlinks/file1, saved in 680.670s (10.205 MiB added)
new       /home/umberto/Desktop/hardlinks/file1_hardlink_2, saved in 681.028s (0 B added)
new       /home/umberto/Desktop/hardlinks/file1_hardlink_3, saved in 0.738s (0 B added)
new       /home/umberto/Desktop/hardlinks/file1_hardlink_4, saved in 0.725s (0 B added)
new       /home/umberto/Desktop/hardlinks/file1_hardlink_5, saved in 0.689s (0 B added)
new       /home/umberto/Desktop/hardlinks/file1_hardlink_6, saved in 0.694s (0 B added)
new       /home/umberto/Desktop/hardlinks/file1_hardlink_7, saved in 0.689s (0 B added)
new       /home/umberto/Desktop/hardlinks/file1_hardlink_8, saved in 0.738s (0 B added)
new       /home/umberto/Desktop/hardlinks/file1_hardlink_9, saved in 0.752s (0 B added)
new       /home/umberto/Desktop/hardlinks/file1_hardlink_10, saved in 843.410s (89.795 MiB added)
new       /home/umberto/Desktop/hardlinks/, saved in 843.412s (0 B added, 45.666 KiB metadata)
new       /home/umberto/Desktop/, saved in 843.412s (0 B added, 384 B metadata)
new       /home/umberto/, saved in 843.412s (0 B added, 381 B metadata)
new       /home/, saved in 843.412s (0 B added, 380 B metadata)

Files:          10 new,     0 changed,     0 unmodified
Dirs:            4 new,     0 changed,     0 unmodified
Data Blobs:     65 new
Tree Blobs:      5 new
Added to the repo: 100.046 MiB

processed 10 files, 1000.000 MiB in 14:19
snapshot cd863199 saved

so looks like the backup worked as expected. S3 console confirms the same size

While I was there I ran a restore as well:

➜  Desktop restic -r s3:s3.amazonaws.com/mybucket/umberto/hardlinks --verbose --verbose restore latest --target hardlinks_restore
repository d9083000 opened successfully, password is correct
restoring <Snapshot cd863199 of [/home/umberto/Desktop/hardlinks] at 2021-04-08 08:10:19.442100909 +0200 CEST by umberto@ultra> to hardlinks_restore

➜  Desktop ll hardlinks_restore/home/umberto/Desktop/hardlinks      
total 1001M
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_10
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_2
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_3
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_4
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_5
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_6
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_7
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_8
-rw-rw-r-- 10 umberto umberto 100M Apr  8 08:02 file1_hardlink_9
➜  Desktop du -hs hardlinks_restore/home/umberto/Desktop/hardlinks
101M    hardlinks_restore/home/umberto/Desktop/hardlinks

all look good! I’m going to retry and be more patient next time

unicolet · April 8, 2021, 1:00pm

Update: restic backup has been running for a few hours and it appears that my previous concerns were greatly exaggerated. While restic says it has uploaded > 4TB, S3 shows a much lower number, which is is within the actual data volume we have.

Thanks!

cdhowie · April 8, 2021, 7:15pm

Note that even if hard links were not supported, there is deduplication and so the contents would only be saved once. Hard link support is important during restore so that the hard-linked status of the files can be properly reconstructed, but is not required to realize space savings of duplicate files.

unicolet · April 9, 2021, 7:42am

understood, this would explain the behavior thanks!

Slightly OT: could deduplication be the cause why the the upload is going so slow? I mean, does restic has to go back and forth to S3 to fetch the files’ hashes? (we have 1Gb direct connect to S3, so network is not an issue)

edit: we have ~ 500000 files

martinleben · April 9, 2021, 3:23pm

No. In the repo, every file is named after its hash. And that information is in the cache.

cdhowie · April 10, 2021, 4:35am

The most likely bottleneck is local I/O. Second most likely is network latency (not throughput).

unicolet · April 10, 2021, 6:36am

I thought about that, for the record restic is reading from NFS. rsync takes 5 minutes to compare the copy on local storage vs the replica on NFS (the NFS restic later reads from). Latency is <1ms to AWS. I suppose s3 will have some latency of its own, but the again most (90%) files should not be uploaded at all since they are hardlinks

I’ll try running some trace next week. Thanks again for chiming in

cdhowie · April 10, 2021, 1:11pm

rsync is able to compare metadata only, most likely. If you have not completed a backup with restic yet, restic has to read and hash the contents of all files. Subsequent backups can use metadata comparison similar to rsync. It’s expected that the first backup run takes a bit longer.

unicolet · April 11, 2021, 6:23pm

that makes sense, actually in the first run, when target is empty, rsync will just send everything without comparing and/or waiting to traverse the whole tree. If restic needs to read every file and hash it, then it would explain why it takes its time pacing through several dozen TB of files.

Before I opened this thread I saw: Randomly Needs to Rescan All Data - #35 by fd0 Show I supply --ignore-ctime to avoid the random rescan?

cdhowie · April 11, 2021, 6:47pm

I’d probably only do this if it becomes apparent that you need to. In the absence of weirdly-behaving applications or filesystems (such as FUSE) it’s not necessary.

unicolet · April 12, 2021, 6:03am

Thanks, the first backup took ~ 30 hours. I’ve now launched another backup, this should be incremental and I expected it to finish much faster, but restic gives me another 31h estimate to completion

5 days between the 2 restic backups, not a lot of changes, certainly not worth 30h (like the first full sync)

unicolet · April 12, 2021, 6:10am

I stopped restic and re-ran it with --ignore-ctime: we’re at 60% in 90 seconds

I guess I’ll want to keep this option