Implications of different hosts backing up the same folder


#1

Hi All :slight_smile:

I’m building a prototype of an backup system using restic. I need to backup something like this:

/mount1/subfolderA
/mount1/subfolderB
/mount2/subfolderC
/mount2/subfolderD

Due to the high volume of folders/files to backup, I’ve set up a set of N nodes running restic and they are randomly picking different paths each time (I track the individual status of the subfolders using a centralized database to know which ones need to be backup). All mount points are present in all the restic nodes.

I wanted to make the system more scalable so for now I’ve disabled the cache to make restic nodes stateless (the results are pretty good even with cache disabled! ) but I have to questions related to this environment:

  • In case I enable the cache, what happens in the following situation?

node1 --> first time backup of /mount1/folderA a and creates the cache (local)
node2 --> will try to do the backup /mount1/folderA, it’s the first time in this node so it will create the cache again (local)
node1 --> will try to backup again and there is a local cache, but is not the cache of the last copy, does this offers any benefit or causes any problem?

The second question is regarding the hostname, which is different depending of the node that launches the backup, so I have this kind of output in restic snapshots.

e2ceec93  2019-01-16 16:56:51  cbox-restic                /mount1/folderA
d4f85657  2019-01-17 11:27:12  cbox-restic-2             /mount1/folderA
de811864  2019-01-18 09:20:35  cbox-restic               /mount1/folderA

which implications have this when trying to recover or applying retention policies? should I create a tag or manually specify the hostname? I am using one repository per sub folder so the path will be always the same in each repo.

Thank you a lot for you advice and congrats for this beautiful work!!!

Roberto


#2

node2 will notice the index files in the repository and will only upload new data, so there should not be any duplicate objects.

If node2 completed the backup, node1 should download the new index files created by node2 and, as before, will not upload duplicate data.

Even if you were using a single repository, the answer would be the same: if the host is not relevant, then you can use restic forget --group-by paths (the default is host,paths) and so this will only consider the paths when deciding how to group snapshots before applying the retention policies.

Tags are also an option, but don’t seem necessary here.


#3

Thank you a lot for your answer !

So this means that I will have some benefit on enabling the cache in my environment?

Thanks!


#4

Today I realized that the hostname is also important, cause if I try to do the backup of the same path from another host, then it does a full backup again. Passing a fixed hostname with backup --host seems to do the trick and the backup is considered incremental. Not sure if this is the best thing to do, cause seems a bit ugly.


#5

No it doesn’t, but it might look like it does.

No, this is not the best thing to do and it’s entirely unnecessary.

All data is deduplicated, across snapshots of different hosts even. Restic does have the concept of a “parent snapshot” which is used as an optimization to detect files that haven’t changed and avoid processing them at all, but processing an unchanged file doesn’t add data to the repository. Restic just wastes time chunking and hashing the contents only to determine that the data is already in the repository, and nothing needs to be done.

Restic chooses the parent snapshot automatically by looking for the most recent snapshot with the same hostname and path set, but you can force it to pick a different parent snapshot with --parent, which is the right way to do what you’re trying to do. This gives you the optimization without putting bad data (incorrect hostnames) in your repository.

But no matter what you do, restic is going to deduplicate any new data. In other words, the end result whether restic uses a parent snapshot or not will be the same, but with a parent snapshot the backup might complete faster.