Un-needed rescanning of files

Hi everyone,

These are my very early days of using restic. I came across an unexpected behaviour and wanted to share my observations.

I was backing up some subdirectories of the directory to be backed up, just to gain confidence and check performance. Eventually, I moved onto backing up the directory itself. What I discovered was that the files from previous backups were unexpectedly scanned again.

Steps to reproduce this behaviour:

1. restic -r C:\restic\repo1 --verbose backup C:\main_backup_dir\subdir1\
2. restic -r C:\restic\repo1 --verbose backup C:\main_backup_dir\subdir2\
3. restic -r C:\restic\repo1 --verbose backup C:\main_backup_dir\subdir3\
4. restic -r C:\restic\repo1 --verbose backup C:\main_backup_dir\

In step 4, restic will rescan all the files already backed up in backups 1, 2 and 3, which is suboptimal, in my view. Perhaps there is a good reason for this, in which case I would appreciate if someone could point it out. Thank you.

Version used: restic 0.18.0 compiled with go1.24.1 on windows/amd64

See Backing up — restic 0.18.0 documentation , then use the --parent option to the backup command, in your example above giving it the snapshot ID of the heaviest snapshot.

3 Likes

@rawtaz - thank you for your kind and informative comment. I think this means that there is no “universal” knowledge (across all snapshots) of file metadata. If the backup path is a superset of existing snapshots, restic has no reference point(s) and treats each file it encounters as something it sees for the first time.

The ability to look up each file being backed up against all existing snapshots to check if it has been seen before sounds like a departure from restic’s philosophy.

Wrong. There is, but it is not enough to decide whether given file is already in the repository without scanning it first. Unless you (or restic itself) provides a “hint” in form of a parent snapshot.

Restic can NOT blindly assume that files with the same path, size and mtime are identical. This could easily lead to restic not backing up some data or even data loss. You have to add checksum to be sure.

Using parent snapshots provides mechanism to make such decisions faster (without checksum) and is good enough for real life backup behaviour where the same path leads to the same data.

So either use parent snapshots or let restic do its job and scan what it needs.

Not sure what in your opinion this philosophy is. And when restic departed from following… restic philosophy. Or rather what you imagined restic philosophy is:)

Hi @kapitainsky, thank you for your comment. I appreciate your time and opinion.

Let’s suppose there are 3 equally-sized large subdirectories (say, 400GB each - my situation) of the main backup directory, plus some smaller subdirectories. Suppose I back the large subdirectories up individually, as I do not feel comfortable kicking of one large 1.2TB+ backup. Now, I have gained enough confidence, and I am ready to back up the entire main 1.2TB directory. According to the documentation shared by @rawtaz, I have a choice of specifying ONE parent snapshot. This invariably means that I am making restic rescan 800GB (1.2TB total minus 400GB of one of the snapshots used as parent) worth of files that were backed up hours ago and have not changed since.

I do wholeheartedly agree that a hash of an entire file is critical as the final authority on detecting changes. The point here is a little different. Calculating a checksum on a file restic has already seen (the same path, size, mtime/ctime, inode) would be prohibitively expensive on the I/O and compute side. According to @rawtaz’s link, restic does not consider a checksum, just like many other backup/copy related tools, such as robocopy. It is a defendable tradeoff between conserving time/resources and detecting file content changes.

Agreed

I never said it has departed from its own philosophy. What I had in mind is that restic seems to follow a strict one-to-one relationship between a “parent” and a “child” snapshot. In cases where there are several large “parents” to choose from, this leads to an unavoidable tradeoff that I have described at the top of this reply. I am perfectly happy to accept this, as I respect that approach, even though I think it could be further optimised. Maybe there is no real need for that in the community, and I am OK accepting that, too.

You make valid observations and I agree with your reasoning but it is academic discussion:) there are many ways how to skin a cat and restic authors decided their approach.

From practical point of view you could use tagging and group snapshots by it. This way you could incrementally add more paths and always maintain parent/child inheritance avoiding rescanning already backed up files.

What is the actual practical problem here? Normally, you’d do this like once when you want to create the first backup of some data set and you risk having the backup process interrupted by external factors and for that reason want to do it in chunks in order to not have restic re-scan all the files in the backup set before the first backup has completed. Unless you have some specific reasons, this is not something you need to do very often.

2 Likes

Hi @rawtaz , there is no practical problem here at this point. It was more of a general observartion/question, or as @kapitainsky rightly put it, an academic discussion.

3 Likes