Restic indexing extremely slow on external 4TB disk

ezzra · September 25, 2024, 7:26am

I have an external 4TB HD and started to backup the disk. Of course everything is going to be slower, but just the indexing of the data takes 10h ! And not only the initial indexing, but every time the backup is disrupted and I need to start it again, it takes again 10h just for indexing, I wonder why caching does not work here? I am still in the intial backup, 50% now.

But I am afraid that everytime I want to create new snapshots even after the initial backup is done I will have to index first 10h.

akrabu · September 25, 2024, 5:34pm

Just throwing this out there but… could it be you have it connected via a USB 2.0 port / cord? One way to tell would be to do a read speed test, and if it’s ~37MB/s or so, that’s likely the issue.

That said, until the initial snapshot is created, it has to re-read everything. Only when you already have a full snapshot, can it “skip” what has already been backed up. So… try to not interrupt it, and once it’s finally done… it will be MUCH faster going forward, from that point on.

ezzra · September 26, 2024, 7:11am

That said, until the initial snapshot is created, it has to re-read everything.

why is that? what is different with a snapshot? before there are only data chunks and then they are organised together in files?

kapitainsky · September 26, 2024, 9:22am

All files have to be read for the first snapshot. And if you have many small files it results with disastrous speed - it is how mechanical HDD are.

Subsequent snapshots (assuming that only small amount of data is added or changed) requires only reading new changes. This is why it will be much faster.

Only metadata are cached. If data itself have to be read it will be always as fast as your full hardware stack.

So as mentioned be patient. Do not disrupt first snapshot creation.

akrabu · September 27, 2024, 3:01pm

Sort of, yes. Files aren’t indexed / referenced until a snapshot is created. Until then, it’s just a bunch of chunks described by a hash. Any one of those chunks could belong to numerous files all by itself (this is how the deduplication happens). Without a snapshot, you have to read everything, compare the chunk hashes, and only write the ones that are missing.

With a snapshot describing what the hashes actually belong to, it can see very quickly what files have been added, deleted, or modified, and only do a full read on those files. Otherwise, you’re just throwing more hashes in the pile and never describing what they actually are.

ezzra · October 3, 2024, 10:28am

I still dont understand, how does restic check everything when the snapshot is created? I cant imagine, that it again creates chunk hashes on all local data again (I guess, that is what made it so slow in the beginning). So when the firs snapshot is created, does it check metadata like file paths, modification dates, size etc. and only then checks for the specific chunks ?

So when I would have a file, with the same size, same path, same (manipulated) m-time, but another content, would restic even recognize that as a change?

ezzra · October 3, 2024, 10:40am

ok found myself the answer here: Restic uses `mtime` to detect file changes, which can miss changes. · Issue #2179 · restic/restic · GitHub actually restic does it like other backup tools and checks “only” the mtime and if that does not fit rereads it, good to know. I missed that in the documentation even when I was looking for it but only found info about the chunk hashes etc. Is there a documentation about the specific procedure that restic is using to check and create snapshots etc.?

ezzra · October 5, 2024, 4:20pm

and me again so I had to move a lot of stuff on the external disk and now started to backup a new folder, same data but everything renamed. Restic needs to index everything again about 10h, thats ok I expected that. But after that it is still taking a lot of time with…I dont know, uploading? I have the scan finished line and below a percentage number and a number of files and how much terrabyte, and those numbers again in total.

So, what did restic actually do in the 10h scan? I thought it would hash the chunks and should now know, that all the chunks are already available on remote and would have “only” to rebuild the file structure. But as it takes quite some more time and the ETA is on 18hours, what is it actually doing now? Maybe the “scan” is not the hashing, but just scanning metadata to check which files need to be hashed?

Again, is there some more docu about the process/algorithm that is happening on a backup? I have seen restic Design Document - Documentation for restic but as it explains the repo structure it does not explain very much details about the steps taken while running a backup.

damoclark · October 6, 2024, 5:32am

Hi @ezzra

Without providing Restic output and specific commands used, you make it difficult for people to answer your questions.

If you mean that the folder where your files were located last time you did a backup changed for your 10h backup, then I’d say that a parent snapshot could not be determined, and so restic had to read every file and compare its chunks with the pack indices. This is what I expect took 10hrs, although without the output, I can’t see whether it specified “no parent” or not.

In this case, if a chunk already exists in the repo, based on the index, then it is not uploaded again. But it still needs to be read, and hashes calculated to check.

ezzra · October 6, 2024, 12:13pm

Unfortunately I could not copy the restic output (I guess because its always overwritten), that is why I tried to explain what I was seeing on the console (will make a screenshot next time).

However, that restic needs to rescan because it cannot anticipate a parent snapshot is fine, I expected that. And the scan was finished after 10h, so I got the “scan finished” output.
What is unclear for me: What exactly does restic do when “scanning” ? Is it only listing files and checking for metadata and things like hardlinks etc? And does it hash chunks, or is this not part of “Scanning” ?

Then, what does it do after the Scan is finished? Does it only now start to hash all the chunks from files it anticipates as new (which would have been every file in my cas)? Or did it do that already, while scanning?

BTW, my config is quite boring:

#resticprofile
disks:
  initialize: false
  repository: repo-path
  password-command: password-command
  verbose: true

  backup:
    no-error-on-warning: true
    exclude-caches: true
    exclude-file: path-to-exclude-file

damoclark · October 7, 2024, 12:14am

This is exactly what we need to help answer your questions. Not boring at all.

Now I understand why you are asking about “scanning”. You have verbose output turned on.

Here are the abridged steps that Restic performs when doing a backup. This is oversimplified, to focus on your question about scanning.

When restic starts,

It loads an index of all the data chunks stored in the repository
It identifies (if it can), a parent snapshot, and loads an index of all the files and folders that are part of that snapshot, along with their size and last modified timestamps
Then, it “scans” your source drive to generate a list of all its current folders and files, their sizes and their last modification timestamps
It compares each file and folder from your source drive, with those from the parent snapshot index to identify changes. If:
a. it’s a new file; or,
b. it’s size and/or modification timestamp differ, then it gets added to a list of files to be backed up
Flagged files and folders in this list for back up are then chunked and compared with the chunk index. If:
a. the chunk already exists in the repo, it is skipped
b. otherwise, the chunk is added to a pack file, stored in your restic cache
Once a pack file reaches the desired size, it is uploaded to the repository

Because you changed the location of the files you are backing up, Restic sees it as a new backup - the new path doesn’t match any previous snapshots.

Without a parent snapshot, step 4 is skipped, and all files from step 3 are flagged for chunking and comparison with the chunk index. This is why your backup took so long. It has to read every single file on your disk drive.

So you were asking about “scanning”. This is step 3. How long step 3 takes, depends on how many files and folders you have. But Restic doesn’t wait until the scanning is complete before proceeding.

Restic is written in a programming language called “Go”, which has excellent support for concurrent execution. What this means is that Restic does multiple things at the same time. In fact, Restic is performing step 3, step 4, step 5, and step 6 all at the same time. The lists that are created in step 3, and step 4 are actually queues. Step 3 adds files and folders to the end of the queue, and step 4 pulls items from the beginning of the queue. Similarly, Step 4 then uses another queue to pass files to Step 5 so Step 5 can chunk them up.

Once step 3 is complete, and all files and folders have been “scanned”, that task ends, and only steps 4, 5, and 6 continue until everything is complete.

And the magic ends!