Prune over low bandwidth links

wscott · October 8, 2017, 10:02am

I am setting up scripts to do nightly back to a machine at another location over a relatively slow internet link. So I want to avoid unnecessary traffic.

I am doing the initial backup to an external hard drive and will carry that to the destination machine and then will create a new snapshot daily. I planned to run a script weekly (or bi-weekly) to prune necessary snapshots and validate that the saved data is correct on the server.

I think restic forget is fast to run remotely but what about restic prune? It seems if it is rewriting whole packs then it would be re-transferring the bulk of the data, right? So should I run that on the destination server? Yes, I realize that means I need to break security and store the password on the server as well.

What about data scrubbing? Should the server just validate pack checksums or should I use restic check? And again check needs lots of bandwidth, right?

Does the rest-server put enough smarts at the destination of a connection to allow processing data for these maintenance tasks on the destination server?

zcalusic · October 8, 2017, 10:22am

You’re right on about everything. The operations would be much faster if done on the remote side, but then you need to trust that server with your password. It’s up to you to decide.

The caching layer available now in the restic master branch improves perfomance a lot, so if you can spare some disk on the source machine (typically 1-8% of the repo size, depending mostly on the number of files and snapshots), it might work well even without compromising security, you need to test it in your environment.

rest-server fully supports all restic methods, so yes, you can run all usual restic commands with rest-server as a backend, no problem.

wscott · October 8, 2017, 10:35am

Thanks for the reply.

The question was, does the rest-server allow some processing to be done remotely?

For example, without breaking security when running restic check the server side could verify that all the files on the disk are unmodified and have the correct checksums. Then it seems the client could complete the rest of the checks with just a list of files on the server and the local metadata cache.

However, can a ‘prune helper’ run remotely to help reshuffle the data without needing to decrypt the data? Unfortunately, I don’t have enough of the model in my head to answer that? It seems the metadata should have the offset/len of each block of data in the packs so perhaps the client could calculate which chunks should be kept and then the helper on the server could do the actual data manipulation. Then only the metadata files would need to be retransmitted.

But for now, it seems I should do my weekly maintenance tasks on the server.

zcalusic · October 8, 2017, 11:05am

No, rest-server can’t do that for obvious reasons. Being on the remote side, it doesn’t know password to decrypt the blobs, so it won’t do any processing by itself. For example, if a pack needs to be rewritten, restic client will pull it locally, rewrite it, than push it back. In all operations, rest-server is only a dumb transport layer, it doesn’t care about data it’s transferring, nor it should.

If you really mostly care about performance, then doing maintenance operations on the remote side is your best option. As repo layout is the same between local & rest-server backends, at least you can do that. Others, using various other cloud based backends are out of luck there. OTOH, they wouldn’t even want to do something like that, 'cause they host their backups with untrusted third party providers.

You can learn more about restic threat model here: http://restic.readthedocs.io/en/latest/100_references.html#threat-model

wscott · October 8, 2017, 11:53am

Call me thick headed, but that isn’t totally obvious to me. The data packs should be a series of encrypted blobs. My assumption is that metadata contains an index that maps blob IDs to a (packfile,offset,length) tuple. The client could determine which packs are present and which could be removed and then send instructions to a helper on the server that says how to construct a new pack by copying pieces from the existing packs. Afterwards, the unused packs could be deleted and the new index files could be uploaded. The would allow only the metadata to be manipulated over the wire and the bulk of the data to just be moved on the server. And since it is moving already encrypted chucks the data doesn’t need to be decrypted on the server.

I totally get that @fd0 doesn’t need to do it this way, but I am just trying to understand if it would be possible.

fawick · October 8, 2017, 7:33pm

Let’s call the ability to freely re-arrange pieces from old and into new packs a “smart backend”. Currently, none of the existing backends can do that. What it requires is to move away from a the object storage model where one blob is an atomic entity (which is one of the foundations of the simplicity in the restic.Backend interface).

It is possible to extend rest-server in a way that it became smart backend as it consists of logic under the control of the restic authors. But the same is probably not true for the third-party object storage providers (S3 etc). While most of them are capable to answer range requests (i.e. to obtain only a part of an object) and some of them may even offer multi-part uploads, the data is still transferred to or from the client and not re-arranged only on server side.

So if rest-server it would become the (or one of few) special case that needs to be taken care of on the client side. Restic would need to be aware whether the backend was smart and react on that. I guess it would break many existing abstractions within restic that were made for the sake of simplicity (a good thing IMHO). The prune code would need to distinguish between smart and classic backends a lot, only to support a very specific and rare use case.

But is this the only approach? What if there was a small on the server that forwarded the authentication to the server somehow and called a local restic instance on the server.

wscott · October 8, 2017, 8:20pm

This is a very good point and is a good argument against what I am suggesting.

Unfortunately, I don’t think this is possible while retaining the security model. If I don’t trust the server then any proxy running on that server also can’t be trusted. I can’t pass credentials to anything running on that server or trust that a restic binary hasn’t been tampered with.

Oh well, I guess I will embrace the no security model and write some scripts to do weekly maintenance on my server.

Thanks for the input.

fd0 · October 11, 2017, 5:53am

In principle the described “re-shuffling” is possible, the pack files consist of a number of blobs which are encrypted separately. In order to verify that everything worked out well, with the current repo format, restic would need to download the data after re-shuffling.I have plans experiment on that with a new backend, but no code was written yet.

jtagcat · March 18, 2020, 9:17am

@fd0 Any updates?

If your storage is not owned by you, you could also look in to getting a VPS, though you’d have to trust the VPS provider.

764287 · March 18, 2020, 2:46pm

@alexweiss opened PR #2513, which (if I understood correctly) should help in this situation. With the newly added commands you are able to cleanup your repository but omit the re-write of packs, which is the most time consuming part because re-written packs need to be uploaded again.

Would be great if you could test this PR on a testing repository.