Is it safe to sync a local repo to another machine using Dropbox, Google Drive, etc.?

fortinmike · January 6, 2020, 6:34pm

Our current backup scenario includes restic performing backups directly to a local repo (on an external drive) so that the backup operation itself takes as little time as possible.

Would syncing that repo directly to another machine using Dropbox be safe? (not using Dropbox directly as a backup destination in restic, but syncing the repo’s files themselves through Dropbox)

I’m wondering about this scenario specifically:

Machine A backs up to a local repo (on an external drive)
Dropbox syncs that repo’s files to Machine B (after that, all repo files are the same on A and B)
A new snapshot is taken in Machine A’s local repo
Dropbox starts syncing the new / changed repo files from Machine A to Machine B.
The connection drops midway through the transfer and syncing the new snapshot’s files never completes (or a restore is performed on the destination while Dropbox is downloading new repo files).

In that scenario, is the restic repo on the destination machine (which received the repo files through Dropbox) corrupted? Is there potential for the repo to become inaccessible / corrupted in the time period where Dropbox is not done syncing?

Would that be problematic somehow, or does the way the repo is structured on disk make it resilient in that scenario?

rawtaz · January 6, 2020, 10:12pm

If you are just adding backups to your repo, restic is only adding relevant files, it doesn’t change the existing ones. So I would say it’s fine, worst case you’ll just not have that new snapshot on machine B.

If OTOH you do things that changes files in the repo, such as pruning the repository, and those changes aren’t completely synced, then I wouldn’t expect the repo on machine B to be useful until the sync is completed.

All this said, over my dead body that I’d trust the combination of e.g. a USB-attached external drive and Dropbox for syncing stuff. What problem are you trying to solve that you can’t solve by just running two backups/repos separately instead? Or use something else than Dropbox out of all the things in this world to sync the files (e.g. rsync or rclone).

fortinmike · January 6, 2020, 11:27pm

Ok, this confirms my high-level understanding of how restic stores things on disk. Thanks.

Or use something else than Dropbox out of all the things in this world to sync the files (e.g. rsync or rclone).

rsync is not really an option on Windows, but rclone certainly is (already tried it for something else). Syncing the repo through Dropbox (or another syncing service), if reliable, could have been a simple way to get a copy of the repo on one or more machines without requiring special setup on the target machines or their networks. As for using a second offsite repo (which is understandably much preferred), see below.

What problem are you trying to solve that you can’t solve by just running two backups/repos separately instead?

The thing is, our current backup strategy involves taking a website and its database down for the files to be closed (especially its SQL Server DB), then backing all of it up in one coherent chunk, including multiple user-uploaded files that map to paths in the database (files on disk must match the contents of the DB). Minor downtime (a few minutes in the middle of the night) is acceptable for our service.

To minimize downtime as much as possible, we’d like the site backup step itself to be performed as quickly as possible, which is why we favour a local repo. Many of our sites contain multiple gigabytes of data, and multiple gigabytes can sometimes be added daily, depending on the site. Adding a second (offsite) backup destination for each site backup task would imply much more downtime per site. This is why I was exploring what was possible to do with the “raw” restic repo itself.

Which makes me think: is cloning a restic repo in a more “intelligent” fashion something that restic itself supports or could support? Something like an hypothetical restic clone /my/local/repo b2:bucketname:path/to/repo which would only transfer known snapshots (ignoring in-progress backups) in a safe way?

In any case, suggestions are welcome. Thanks for your help!

ProactiveServices · January 7, 2020, 12:23pm

It should be OK but whatever third party you’re using to send the repo data, it would be safe to verify the data at the far end. A frequent scheduled restic check on Machine B as well as occasional restic check --read-data (or --read-data-subset to check x% of the total repo) would mean that you can be sure the repo has been identically copied across.

Somewhat relatedly, and depending on your set-up, you might want to look at Syncthing to sync the data over. Saves relying on a third-party service, storage restrictions, privacy/security implications etc.

cdhowie · January 8, 2020, 12:24am

It’s unlikely to be corrupted, but be aware that it might look damaged depending on the order in which the files get synchronized. In particular, if a snapshot file is synchronized before all of the data and index files it references, you will get errors from restic check complaining about the missing data. The repo can be “fixed” from that state by deleting the offending snapshot files – however, Dropbox might synchronize the deletion back to the original system!

Using Dropbox to sync also does not really count as having an off-site copy because machine A can damage the off-site copy on machine B. If the Dropbox account has sufficient access, machine A could even delete file history from Dropbox.

I would instead strongly recommend the use of rsync from machine B pulling the files from machine A, and configuring permissions so that the user on machine A used by machine B does not have write access to the source. This will prevent machine B from being able to tamper with the authoritative copy of the repo on machine A.

Machine B could also use rclone copy --immutable instead to only accept new files from machine A (not deletions or changes). This is a great way to maintain a known-good offsite copy that cannot be corrupted by the primary copy. (Note that this cannot be run concurrently with any backup/prune operation on machine A, or a partial pack might be copied which will never be updated later. restic check would detect this case, however, and the partial pack could be deleted by and and then the rclone command run again to obtain the final pack.)

+1 for Syncthing (disclaimer: I’m a contributor). In this particular scenario, you can set up machine A’s folder as “send only” and machine B’s folder as “receive only.” This would also prevent machine B from being able to tamper with machine A’s copy.

This has the same “machine A can destroy the off-site copy” caveat as Dropbox, however, unless you enable file versioning on machine B’s Syncthing folder. It might be a bit of work to find the correct version of each file and restore them as appropriate.

You could also script an export of the database, which should be able to run concurrently with your application.

If you’re using SQL Server on Linux, you can do what we do with MySQL: take an LVM snapshot, mount that, and back that up using restic. The LVM snapshot is atomic so restoring those files will look like the power cable was yanked at that exact moment. SQL Server should be able to replay its journal to make the data files consistent.

fortinmike · January 8, 2020, 3:23pm

@cdhowie Great info, thanks!

In particular, if a snapshot file is synchronized before all of the data and index files it references, you will get errors from restic check complaining about the missing data.

Note that this cannot be run concurrently with any backup/prune operation on machine A, or a partial pack might be copied which will never be updated later.

Ok so if I understand well, in all suggested scenarios (regardless of the syncing method) the repo on machine B might have issues restoring without errors, even if it’s not corrupted per se, especially if syncing or restore occurs at the “wrong time”. This is not ideal for time-critical restores even if it can be worked around. I thought restic might have been able to automatically restore from the last known-good snapshot for a specific path if it encounters a “broken” snapshot.

You could also script an export of the database, which should be able to run concurrently with your application.

I was going to tackle that next. It seems like that might be the only option (combined with a second, offsite restic repo) to keep a reliable offsite copy.

If you’re using SQL Server on Linux, you can do what we do with MySQL: take an LVM snapshot, mount that, and back that up using restic. The LVM snapshot is atomic so restoring those files will look like the power cable was yanked at that exact moment. SQL Server should be able to replay its journal to make the data files consistent.

Interesting, but unfortunately it’s a Windows box. For the record, I researched Volume Shadow Copy (aka Volume Snapshot Service / VSS) which seemed to be somewhat similar on Windows, but it seems like it might not always play nice with SQL Server (even though it officially supports it).

cdhowie · January 9, 2020, 2:31am

The simplest way to work around this is to always sync $REPO/snapshots first, then sync everything else. In this situation, a broken snapshot is impossible after that specific sync. (If using rclone copy --immutable then you need to address partially-copied packs before the next copy operation. If you’re not doing an immutable sync then this is not an issue, but it does allow a malicious actor on machine A to trick machine B into overwriting good data with bad data.)

There’s other approaches that also might work, such as: (this all takes place on machine B)

cp -al the repository to a temporary directory. This will take very little additional disk space since it uses hard links. You’ll sync to this temporary location to avoid potentially damaging the primary copy until it’s been verified.
rclone copy --immutable from machine A to the temporary repository twice, once for $REPO/snapshots then again for everything else.
Run restic check on the temporary repository.
- If there are no errors, recursively delete the original repository and move the temporary repository where the original was.
- If there are errors, don’t do anything and mail the sysadmin the restic check output for manual intervention.

This is not currently an option. Restic does not try to be too smart; rather it fails and lets the operator figure out what to do about the problem.

ProactiveServices · January 10, 2020, 6:24pm

github.com/restic/restic

Add support for VSS (windows)

opened 09:19AM - 04 Nov 15 UTC

closed 01:43PM - 24 Oct 20 UTC

frederich

category: backup platform: windows type: feature suggestion

The restic backup runs. In the mean time my work is going on. Restic come to the… point where it saves my Outlook archive. But Outlook is still open, restic fails. ``` [2Kscanned 77389 directories, 924469 files in 7:49 GiBerror for e:\home\jf\archive\outlook\archive.pst: SaveFile() chunker.Next(): read e:\home\jf\archive\outlook\archive.pst: Der Prozess kann nicht auf die Datei zugreifen, da ein anderer Prozess e ```

ProactiveServices · January 10, 2020, 7:05pm

My backup flow would be thus, similar to scripts I use here and at customers:

Stop relevant services / wait until particular processes have exited
if OK…
restic backup
re-start services as necessary
if OK…
restic check (–read-data every x weeks / runs etc. as desired)
if OK…
Start off-site/off-system sync, be it a call to rsync or a Syncthing folder rescan request
send / email logs as required

fortinmike · January 10, 2020, 10:09pm

I don’t want to sound prescriptive but IMHO a core thing all backup tools should aim for is to facilitate restores as much as they can, including accounting for edge cases like this. The last thing you want to do when disaster occurs is to fumble through documentation and commands trying to get your files back. If you have multiple snapshots (say, tagged snapshots or snapshots for different paths) the time required to do this manually could quickly get out of hand, unless one has the foresight of creating (and maintaining) scripts for this beforehand.

Maybe restic could warn about broken snapshots (which might occur for other reasons than a bad sync) and automatically fall back to the latest valid one that matches the provided criterion when using “latest”?

I do understand the Unix philosophy of keeping a light tool that does one thing well but my opinion is that a few smarts would be nice to ensure restores are as seamless as possible, as restore time can be critical to many businesses (including us).

@ProactiveServices It would indeed be nice if restic directly supported VSS in some way, if that’s possible without coupling the tool with Windows. Still, I would probably not trust VSS to backup SQL Server databases, as it seems like backing up *.mdf and *.ldf files is generally not recommended anyway. So what I do is this (as suggested by @cdhowie):

Archive the database as a .bak file beside the mdf and ldf files (by logging into the server and performing a BACKUP DATABASE query)
Back up the site using restic, ignoring mdf and ldf files

Doing it this way, in addition to being the recommended way to backup SQL Server databases, prevents from having to shut sites down and allows us to directly backup to a second offsite destination while the sites are still running.

After all that was said in this thread, I’m much more comfortable doing that than trying to sync the repo in some hacky way afterwards.

cdhowie · January 11, 2020, 3:15am

This sounds a bit dangerous to me. I don’t want my backup tools guessing what to do, I want them to do what I asked them to do. If this logic is behind some optional flag then I would not oppose adding this feature. I would absolutely not want it to be the default behavior. (But I also pretty much never use “latest” with restic because I want to be precise when describing what I want restic to do.)

As I mentioned before, the correct way to avoid this problem when syncing repositories between sites (when you cannot guarantee the repository is not being written to) is to sync snapshots first and then the remaining directories. This still is not safe if a prune operation runs concurrent to the sync operation.

fortinmike · January 11, 2020, 3:46am

Fair enough. Maybe logging something like this when a snapshot restore fails would do:

“Failed to restore latest snapshot because xxxxxxx”
“Use the --xyz flag to restore the last valid snapshot”

Just thinking out loud. Seems like that would be a nice convenience to have.

cdhowie · January 13, 2020, 9:10pm

Note that restic can’t tell if a snapshot isn’t damaged without actually crawling the whole thing, so we’d have to decide if it should try to restore the snapshot and switch to a prior one on failure, or validate the snapshot before trying to restore.

Validating the snapshot in advance has the advantage that it won’t leave extra files around. Consider the case where the latest snapshot is damaged and has files that the prior snapshot doesn’t. If those files get restored before restic notices that the snapshot is missing data, we can delete them (which seems like something a backup tool shouldn’t do) or ignore them (which will leave them around which would probably not be desired).

rawtaz · August 29, 2020, 10:08am

Please see HELP WANTED: Testing Windows VSS support .