Suggestion: Add file attribute with hash to avoid rescanning moved files

arberg · July 17, 2025, 10:01am

If I move / restructure my files, restic has to rescan. It can take a long time. If restic writes file attributes containing the hash of the file, and its last modification date (mtime) we could avoid rereading files that have matching mtime. I assume restic can lookup the needed block hashes from the full file hash.

It would also make initial scan much smoother in case of incomplete full scans. It can be quite a pain to scan large TB collections, with risk of powerloss or otherwise process-death or if user realises filters are set incorrectly during the scan, as may happen multiple time during an initial scan. For each restart restic has to rescan files from the beginning and read every file completely, as it does not have a snapshot yet. This problem would be solved or at least greatly alleviated by adding hashes to file-attributes.

This could be added as an optional feature, like with a new ‘–use-file-attribute-cache’ alongside related options:
–force
–ignore-ctime
–ignore-inode
–with-atime

Since update file attributes changes ctime but not mtime (on linux), it would have to imply ‘–ignore-ctime’.
On windows mtime is updated when file attributes are updated, so we would have to reset mtime after writing file attributes. This would add a small race condition risk, which would have to be documented.

Of couse when using --force along with --use-file-attribute-cache, these file attributes should ignored, but updated and written.

Additionally it would make sense to only write file attributes to larger files, as scanning a 1kb file is fast, and it is not worth the time to write and check the file attribute.

–
Additionally, we might also add an option that writes .restic.sha256 files to folders, containing the mtime in a comment like this:

#>1032561628000;40658
5d3835fb350d770816f954246aad31e522cba0291dce306c19fc7225146863cc *syslog-20250714_1752460801

Though I suspect few would use that, but the advantage is that user can use the files to verify files directly in dirs, and when moving dirs accross filesystems via samba, the checksums are preserved. Also its very fast reading checksums from 1 file pr dir rather than opening every single file in a directory. If its an option I don’t think it would hurt even if few use it.

As virvum mentioned in his earlier suggestion we should (probably) omit restic’s own file attribute when restic store file attributes, or atleast update the file attribute before storeing them in restic backup.

I have been doing this for years with a md5sum tool I created, and it works great. After renaming a local share containing 8TB of data, restic had to rescan all (twice because I have data linked both via /mnt/disk, and /mnt/user), and it took several days, compared to my md5sum tool which did it in minutes.

If there is interest for this feature, I could probably give it a try and add it to restic myself.

colorimeter · July 21, 2025, 9:27am

How do you ensure that the contents of the file cannot change once you have saved the hash? Would you also store the hash calculation timestamp?
Even then, you still have to rely on the validity of the metadata. And you have to litter the filesystem with them, plus be careful not to backup them as extra.

arberg · July 21, 2025, 11:30am

[quote=“colorimeter, post:2, topic:9918, full:true”]
How do you ensure that the contents of the file cannot change once you have saved the hash? Would you also store the hash calculation timestamp?[/quote]

Restic already does this fast scanning and rely on filestamps to not scan files fully. That is the whole point, it does not rescan every single file, but instead rely on it being unchanged if the mtime and ctime is identical to the time in the snapshot. Here I store the same info, so essentially each large file would have a (mostly invisible) file attribute containing the info restic needs to verify it has already been scanned. We would as mentioned have to omit ctime, as writing a file-attribute update ctime. See restic current scanning documentation: Backing up — restic 0.18.0 documentation

Yes in the same way that we rely on the validity of the restic repository. If we wanted to be rediculously safe we could sign the file-attribute hashes. I find that ridiculous as a user would be unlikely to try to mess with their own data and other programs unlikely to write a restic file-attribute in the proper format, but indeed its an option if so desired. And i do like the extra security, from a provable security standpoint.

I would not call writing file attributes with littering the file-system as they are mostly invisible. Indeed I would suggest not backing them up, as mentioned. But that’s easy, it would just have to be part of the job.

It does take time to write the file attributes, hence the suggestion to skip writing them on small files. But it is a HUGE time saving factor on a GB file, and on 10MB files as well, when it saves us a full scan.

As mentioned I’ve done it for years in another program quite successfully. I do a yearly forced full scan and logs all inconsistencies and there are almost never any. Well after I learned to make TrueCrypt update mtime on save, but that’s also why I have a logging included with the forced scan, so I can log which mismatches there were. Restic could also do this, log any file encountered during a --force rescan for which its file attribute meta-data / snapshot metadata indicate file should be unchanged, but yet file was modified. Possibly restic already does this on --force, i don’t know.

MichaelEischer · July 21, 2025, 7:39pm

I don’t particularly like the idea of modifying the dataset that should be backed up.

No it won’t. The problem with the initial scan is that the reference information from the snapshot, which is used to skip unchanged files, simply doesn’t exist. This will only change if there’s some kind of partial snapshot that gets stored somewhere. And just to be clear: that information is too large for file attributes.