Avoid duplication when moving folders/directories

edkedk99 · June 12, 2023, 7:14pm

This question is similar to this post. I want to ensure I had the correct understanding of the answer before acting.

For some reason, I want to move the content of a folder to a different folder. I am afraid restic would understand that the files from SOURCE directory was deleted and new files appeared in the TARGET directory.

If that happens, my backup location would use the double space needed because restic would not know that the files from the SOURCE directory is the same of the new TARGET directory.

My understanding of the answer in the post mentioned above, is that restic somehow would detect the files from SOURCE are the same from TARGET, so the duplication would not happens and only new and modified files will increase my storage backup size.

Is my understanding of the answer in that post correct? Can I move files and restic will take care by itself to verify the files are the same and will avoid deduplication as usual?

kapitainsky · June 12, 2023, 7:38pm

restic does not understand anything - it just checks is file content you are backing up is already in the repo or not? If yes it will save new metadata and that’s it. metadata in SOURCE will be updated. metadata in DESTINATION will be updated.

This is what content deduplication is about.

So long story short - you can move/copy your files around without any worry that they will be backed up twice (but metadata).

edkedk99 · June 12, 2023, 7:44pm

In fact “understand” was not a good word in this context. But this is a very interesting feature I was not expecting restic to have.

Thank you for your answer.

punchcard · June 13, 2023, 9:30pm

My understanding is slightly different than yours. Restic will check if a file with the same directory and name and size (and perhaps last change time) has already been backed up. If not then it will compress, encrypt then try to store each chunk. If that chunk already exists in the repository then it does not need to store it again but will use a reference to the existing storage.
This means that if you move files from directory1 to directory2 on the next backup the files seem like they have not been backed up so restic will compress and encrypt then and only then will it find them in the repository and will not duplicate the space. This may matter if you have a lot of files with directory changes.
This is different to the other explanation because restic still does the compression and encryption for the moved files but does not store them. The backup after that will find the directory2 files in the repository and will not do more compression, encryption and storage.

kapitainsky · June 13, 2023, 9:33pm

This is exactly what I meant:) Sorry for not being clear enough. No data will be duplicated. Only information about file name, its location etc. (metadata)

edkedk99 · June 15, 2023, 2:10pm

Thank you for your clarification. As I understand, the conclusion is that the first backup after moving the files will take time but not so much new space will be used.

kapitainsky · June 15, 2023, 2:56pm

Yes. It will take less or more time needed to read this file from your local disk + some tiny new data written to repository (metadata). So many times faster than if this file would be something entirely new.

masaeedu · February 1, 2024, 8:52am

Hello @kapitainsky.

If yes it will save new metadata and that’s it. metadata in SOURCE will be updated. metadata in DESTINATION will be updated.

Thank you for the explanation, that’s helpful. Out of curiosity, is the amount of duplicated information (metadata as you call it) proportional to the size of the directory tree (not its file contents)? I.e. would a deep and complicated directory tree (with zero-size files) cause more overhead once backed up after move than a pretty simple tree?

kapitainsky · February 1, 2024, 8:58am

more objects = more metadata

But unless you have millions of files and directories metadata size is very small.

I would not waste time trying to reduce numbers of directories in order to save tiny amount of backup space.

masaeedu · February 1, 2024, 9:04am

Thanks. Basically, after I move the folder and back it up, my fans start spinning faster and restic goes off and starts doing a lot of stuff (at this point it’s been 10 minutes). So I start worrying how much more data will be added at the end, and if I should resize the backup partition, etc. (kilobytes? megabytes? gigabytes? does it depend on the complexity of file tree?)

If I understand you correctly the file tree metadata is duplicated, but it’s negligible overall (even in e.g. a tree with lots of git repos and build artifacts).

masaeedu · February 1, 2024, 9:07am

I just realized I’m just rehashing what @punchcard discussed. Sorry for not reading the thread carefully.

kapitainsky · February 1, 2024, 9:15am

Your high CPU usage is caused by need to read all data again (and do hashing and all dedup magic). Not by data transfers.

It depends on number of objects. It will be very different for 10 files than for 10 million files.