I’m setting Restic up, ideally to run multiple backup jobs at the same time, on the same computer and reading from the same filesystem. Each job will have files listed explicitly with the --files-from
flag, and I have a separate unrelated script to populate those files, on each run (starting anew each day).
Since I currently have 7TB to back up, and growing at ~1.5x per year, and since I’m only getting about 800 KB/s to B2, it will take well over a couple of years to complete if not more.
So my goal is to trade off “back up the entire thing in the shortest amount of time” (surely best achieved with only one job), for three different but somewhat overlapping goals (in order of importance): “my most important data [newest] backed up soonest”, “most number of files backed up soonest”, and “a random sample of data backed up progressively”.
In other words, three jobs each backing up:
- Files in descending order of date (newest-first)
- Because my newest data is always the most important by a wide margin.
- Files in ascending order of size (smallest-first)
- To back up the most # of files the most quickly
- My smallest files are often important office-type documents.
- Files in random order (which my unrelated script to create
--files-from
can accomplish)- This might seem nonsensical, but actually having some data from each of my photography/videography session folders, is vastly better than “all files from a few sessions but none from most”.
While there will be some overlap with the criteria, it will be very low initially, given the nature of my data. But the whole point is that the overlap will grow, to some subjective crossover point where the objectives will be best satisfied by killing the last two jobs forever, and the first job alone will be the only one ongoing.
I’m not worried too much about grinding on the 15 disks in the array with three jobs, because B2 is the bottleneck. And if it matters, the cache[s?] will be on an SSD.
The machine is an old dual-CPU, 8-core (no hyperthreading) server with 16GB ECC RAM. I’m not worried about the CPUs probably running full-blast, as it’s not doing anything else. But RAM might be an issue.
It seems that this idea is possible, considering these posts:
- B2 - multiple hosts to single repository bucket?
- Giving up on multiple backups to same B2 repo, using one repo per backup
- Multiple (parallel) backups to the same repo. Good Idea?
Which leads to these questions:
- Is this a bad idea, in terms of the risk of munging up the repository, and/or performance?
- Will deduplication make the redundancy a moot point, in terms of data to upload and store? (A slight increase is tolerable, like a few %. But a 1:1 increase wouldn’t be OK.)
- When the “Newest First” job stumbles on files that have already been backed up by one or both of the other two, will it be able to mostly skip uploading them again? (Without which, this whole idea is moot.)
- Wil that one that one “job” or newest first snapshot (I don’t know what the correct terminology is here), once complete, be all that’s needed for restoring everything?
- Would there be a benefit to deleting anything related to the other jobs that are rendered defunct once complete? (Benefit in terms of data storage and corresponding monthly fee.)
I’m also unclear on the specifics of how to organize this from a Restic perspective. E.g. do I run multiple jobs to one repository? Or a seperate repo for each job, to the same B2 bucket?
Thanks in advance.