What is the best way to run multiple backups of same filesystem to same B2 bucket?

jim-collier · September 18, 2019, 5:32am

I’m setting Restic up, ideally to run multiple backup jobs at the same time, on the same computer and reading from the same filesystem. Each job will have files listed explicitly with the --files-from flag, and I have a separate unrelated script to populate those files, on each run (starting anew each day).

Since I currently have 7TB to back up, and growing at ~1.5x per year, and since I’m only getting about 800 KB/s to B2, it will take well over a couple of years to complete if not more.

So my goal is to trade off “back up the entire thing in the shortest amount of time” (surely best achieved with only one job), for three different but somewhat overlapping goals (in order of importance): “my most important data [newest] backed up soonest”, “most number of files backed up soonest”, and “a random sample of data backed up progressively”.

In other words, three jobs each backing up:

Files in descending order of date (newest-first)
- Because my newest data is always the most important by a wide margin.
Files in ascending order of size (smallest-first)
- To back up the most # of files the most quickly
- My smallest files are often important office-type documents.
Files in random order (which my unrelated script to create --files-from can accomplish)
- This might seem nonsensical, but actually having some data from each of my photography/videography session folders, is vastly better than “all files from a few sessions but none from most”.

While there will be some overlap with the criteria, it will be very low initially, given the nature of my data. But the whole point is that the overlap will grow, to some subjective crossover point where the objectives will be best satisfied by killing the last two jobs forever, and the first job alone will be the only one ongoing.

I’m not worried too much about grinding on the 15 disks in the array with three jobs, because B2 is the bottleneck. And if it matters, the cache[s?] will be on an SSD.

The machine is an old dual-CPU, 8-core (no hyperthreading) server with 16GB ECC RAM. I’m not worried about the CPUs probably running full-blast, as it’s not doing anything else. But RAM might be an issue.

It seems that this idea is possible, considering these posts:

Which leads to these questions:

Is this a bad idea, in terms of the risk of munging up the repository, and/or performance?
Will deduplication make the redundancy a moot point, in terms of data to upload and store? (A slight increase is tolerable, like a few %. But a 1:1 increase wouldn’t be OK.)
When the “Newest First” job stumbles on files that have already been backed up by one or both of the other two, will it be able to mostly skip uploading them again? (Without which, this whole idea is moot.)
Wil that one that one “job” or newest first snapshot (I don’t know what the correct terminology is here), once complete, be all that’s needed for restoring everything?
Would there be a benefit to deleting anything related to the other jobs that are rendered defunct once complete? (Benefit in terms of data storage and corresponding monthly fee.)

I’m also unclear on the specifics of how to organize this from a Restic perspective. E.g. do I run multiple jobs to one repository? Or a seperate repo for each job, to the same B2 bucket?

Thanks in advance.

dhopfm · September 18, 2019, 8:17am

Probably not the answer you’re looking for but given the circumstances I’d just buy an HDD > 8 TB and store it somewhere safe. If you are very concerned about redundancy get two HDDs and always only connect one of them to the source while leaving the other at a secure location. Far less headache than a backup taking years to complete IMHO (and financially attractive as well).

tomwaldnz · September 22, 2019, 9:18am

B2 Option
800KB/sec is 6.4Mbps, which isn’t slow exactly, but it’s also not fast. I’m wondering if there’s a way to run multiple upload threads in parallel, which tens to scale things quite well. Running multiple uploads might do that for you.

I upload to both S3 and B2, both hit my capped upload speed of 20Mbps but it takes 3 threads to max it out. Note I’m in New Zealand and backing up to a USA region, so latency limits my bandwidth.

Hard Drives
I agree with dhopfm that two hard drives could be a better idea. A good modern hard drive does 150MB/sec or more. I use HGST drives, BackBlaze rates them highly.

My system
I have all my data on two hard drives, which I update every couple of months. Between backups I store everything on S3. I actually use CloudBerry Backup on my PC to sync new files to S3, and I use S3 versioning and encryption, as it’s simpler and I can access the files from any web browser. I use Restic to backup from my web server, from my PC to a second hard drive, and from my PC to my backup drives.

Even when I use Restic I keep a plain copy of key data on the backup drive, as while Restic seems good it’s still not at version 1.0 and I want to mitigate that risk.

jim-collier · September 23, 2019, 12:10am

My primary array is Btrfs raid1 with 2x8TB, 2x10TB, and 1x4tb drives. I already back up locally to another server with an 18-drive ZFS stripe of 3-way mirrors, mostly older/smaller drives, the largest among them 4tb. While an unversioned snapshot could currently fit on a 10TB drive (and in fact does), at a steady growth rate of about 1.5x per year, it won’t for much longer. Drive capacities seem to jump in spurts, and while the largest single drives are sometimes big enough, that’s not always true.

I have few realistic worries about losing data to hardware failure and/or user error. (Though, believe it or else, multiple times over the years with the same or similar configuration, I’ve very nearly lost all of my data to once-in-a-lifetime-level catastrophic hardware failure [twice], and incomprehensibly idiotic user error due to taking too many shortcuts [over-reliance on command-line completion being among the worst].)

The bigger issue is having everything local, and losing it to fire, earthquake, and/or theft. All of my data used to be on crashplan (which I understand doing a complete multi-TB restore from can be a nightmare - I’ve had good luck with individual file restores though over the last ten years). After they canceled their home product and migrated my backup to Small Business - throwing away some 6TB or so in the process - I’m only back up to about 1/3 (1/2 at best) backed up. But I’ve lost confidence that Code42 will be around as a going concern for the long haul, if even much longer.

Anyway, most of my older stuff is already backed up multiple ways, including offsite. And even if it wasn’t, my newest data is always the most important. So as long as I can back up newest first (as most all-in-one cloud-based backup services do by default and most everything can be tricked into doing), I’m OK with not having perfect offsite coverage.

And actually, transporting a HDD back and forth isn’t a great idea. Sure, I know many people get away with it for years. I did too for a good two years or so, a long time ago - I had one of those removable hot-plug caddies (before even USB was a thing), and stored it in a cushioned hard case. I’ve also done the same with multiple pairs of external USB drives configured as ZFS mirrors. But even though HDDs are designed to take a very high G load when powered off, the reality is that even the light bumping of a heavily cushioned case handled delicately, eventually does the drives in. I’ve lost uncountable drives that way. (But with mirroring, no actual data.) It’s just not really an option I consider anymore. There are many ways to all but eliminate even that risk (e.g. multiple mirrored USB sets in constant rotation), but at that point it’s starting to get pretty impractical, at least as a long-term solution. Especially considering that all my data doesn’t always fit on the largest consumer grade, reasonably priced drives. I can’t really justify risking moving two of the “largest drives available at any price”, around every other day or so.

That said, it might not be a bad idea until a new cloud backup is 100% seeded.

Thanks for the feedback.

jim-collier · September 23, 2019, 12:29am

HGST are among my favorites too. But I try to avoid putting drives from the same manufacturer and time period, on the same array, so I never buy all HGSTs. (WD has been another long-running favorite, which now owns HGST, and occasionally I’ll risk a Seagate if a particular model faired well in backblaze’ data.) My own experience matches up with Backblaze published data pretty well, albeit a sample size of several dozen HDD deaths, not thousands.