Hi! New to restic and using it with B2 backend. I’m interested in the data-deduplication as I’ll be backing up client computers that sync data amongst themselves (not a traditional server - client setup.)
For example: backing up “google drive” accounts where there are a large amount of files in common via shared folders, but not all files are in common (users still have some folders just to themselves.) And, in this example, using google drive to store PDF and Open Office files (not just the links to ‘google calc’ instances.)
B2 costs money per GB, so keeping data smaller if possible is advantageous.
I can’t tell from my research if it is OK and/or advantageous from a “data storage” perspective to have multiple clients backing up to the same B2 bucket repository? Will “test.pdf” that exists on multiple computers be stored once because someone else already committed it to the backup? Or, will this get too messy amongst clients, not work, and should be avoided?
Last point in my circumstance - I’m rolling out to 5 or 6 users - so not a major enterprise. We do have growing data storage needs, but still are very “small-office.”
We are mixed OS, but with macOS the most prevalent.
Many thanks for any pointers, advise, or further reading.
Hi, and very much welcome to the restic community!
Your suspicion is correct - restic will only store the contents of that identical file once, even if you back it up from multiple clients/sources. Even if only parts of the file would be identical, those parts would be stored just once, as restic inspects chunks of the files when deduplicating.
It’s not a problem to use the same repository from multiple machines. Each snapshot will have that client’s files/folders listed, so it won’t be messy either, e.g. when you want to restore.
Restic stores the hostname of the client with each backup/snapshot (and you can override this with --hostname), and you can also tag the snapshots with your own tags if needed.
You can also create a separate password for each client (see the key command), so that you can easily remove access to the repo for just that client, would it be decommissioned or stolen.
Have a go with it, fire up a test repo somewhere and test backing up a couple of clients (for a quick test, you can back up just parts of their filesystem, and you can use a simple sftp repo as well)!
As you might know you can mount the repo via fuse, so you can simply browse the snapshots when restoring, if you want.
Can they back up concurrently too? (I know a lock is created but not sure how exclusive it is.)
Yes. All backup operations can be run in parallel. At the same time some repository maintenance operations (prune, check) requires exclusive access (all backup operations will fail while repository is locked exclusively)
Backup can be concurrently, but that may lead to duplicate data being written by different hosts. It’ll be cleaned up by the next run of
restic prune though, so it’s not an issue.
That’ll work just fine, the contents of the file will be stored just once. I’ve written a lot of background information in the restic blog here: https://restic.github.io/blog/2015-09-12/restic-foundation1-cdc
Thanks for the great feedback everyone.
I did a backup of a shared folder that was 6.87GB (most shared, some unique to that computer.)
I followed with another computer, same shared folder, and a total size of 7.48GB.
My repository size is 7.55GB after backing both computers up to the same repository. Very cool.
I was also able to mount the repository with fuse, and browse the backed up files. I guess if I was looking for an older version of the file, I would first list the snapshots, then mount the snapshot I think has the version of the file I need (assuming I hadn’t “forgotten” and “pruned” it away already.) I will read and understand more about file versioning so I can keep some backup “sets” as needed.
I have not tried concurrent backups with multiple computers writing to the same repository - but the time will come!
I enjoyed the article on CDC - a lot of work is going on behind those little scrolling numbers on my command line. Thank you for figuring it all out, and making it available to us.
Additional question about keys: Do keys restrict access so that clients can’t see each other’s snapshots, or does any key give full repo access?
Any key gives full access to the repo, so it’s not a form of ACL in that sense
@x572b - two points to consider regarding backing up to the same repo:
- When pruning that repo from one of the machines, no machine will be able to back up to that until the prune is finished (which can be time consuming with B2)
- If you ever need to recover all the data of one machine, and it’s infeasible to mount over fuse, then keep in mind that you’d have to download the archive of the whole bucket which will be the size of all machines - rather than of just one machine. At your current size, this would be a negligible cost - however, for me, this was the reason I split all my servers off into their own buckets.
Thank you! About your second point… are you saying that
restic restore has to download the entire repo and a fuse mount doesn’t? I always assumed
restic restore would be more efficient.
BTW, my use case will be 2-3 machines and ~500GB total.
restic restore will download only needed files. But unfortunately currently
restore performance is not good enough. Plus it can’t be ‘resumed’ if aborted for any reason.
However this will be improved, probably soon: