Serve S3 backup to restic seemlessly "aka kill the middleman"

cdpb · February 13, 2025, 9:21pm

Hello everybody,
I am using restic since a long time, with multiple machines and a dedicated rest server. All good so far.

Since a while an idea is forming, hear me out

In the cloud native world, everything wants some sort of S3 frontend.
Now, I have a storage provider for instance longhorn, which wants to backup its data to S3 (or local, or NFS). So we have something like this:

longhorn —push backup—> S3 <—pull backup— restic

Why not just … kill the middleman? Or, well, give the middleman a direct wire to restic.

longhorn —push backup----> S3 to stdout | restic backup --stdin

So I propose a user facing S3 interface that directly backups to restic (not via --stdin, more like the rest implementation).
Therefore the repository unlocking would need to happen through the S3 bridge, or the repository is just always open. When a stream of data hits the S3 bridge, it starts a new backup, once it doesnt receive more data for x amount of time, it completes the snapshot and waits for more.

I experimented a little bit with backups from stdin and rclone serve s3/restic combinations, but realised it needs more than that. But maybe I am missing something …

What do you think? Is there maybe a solution to this already?

shd2h · February 15, 2025, 11:59am

Let me see if I’ve understood this correctly.

Longhorn has an inbuilt backup system that expects to write to an S3 backend (or NFS, or local, but we’re not concerned with those right now).
Your current solution is to use restic to ingest the backup data from S3 after Longhorn has written it, and generate a snapshot from that.
What you’d like, is for restic to provide something like a “virtual S3 backend”, which Longhorn can write directly to, and restic creates a snapshot based on that data?

Assuming this was implemented, what would the restore process for a backup created in the manner you’re proposing look like?
I’m assuming Longhorn backups will put a bunch of s3 files into a bucket, but surely those aren’t restorable as-is to create Longhorn volumes?
Also, would Longhorn’s backup system need to “see” the files created by previous backups, when taking a new backup?

As an aside from the feature idea…
There are solutions that work with restic and kubernetes storage providers capable of CSI volume snapshots to take backups direct to S3. As far as I can tell, Longhorn is CSI volume snapshot capable. volsync (Restic-based backup — VolSync documentation) and k8up (https://k8up.io/) are the two backup solutions I’m currently aware of and use, although with Rook, not Longhorn.

cdpb · February 15, 2025, 10:28pm

Yes, you understood correctly.

“virtual S3 backend” is basically what I am looking for, yes.

The idea is not limited to longhorn, many tools have some sort of “backup to S3” capabilities. etcd and nextcloud, to name a few. I looked up k8up, its a nice tool which tickes some(maybe all) boxes. But also k8up wants to backup to S3, which is not so bad in this case because you only have the data twice (source + restic repo) and not three times (source, s3 middleman, restic repo)

Reading data could be done with a restic dump on the requested file, maybe? or a restic mount to read from?

Usally the restic client is doing all the heavy lifting, reading, encryption, repository communication, pruning etc. With the proposed virtual S3 backend, the client is dump (basic S3 PUT at minimal) and the server would need to do what the client is usually doing.

I imagine something like this for practical usage

App Z

capable of backups to a S3 target
unaware of restic / encryption
has default S3 required fields (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, URL)

We could misuse the required fields, for example, the Access Key could be used for authentication to the s3 backend (like .htacces with rest) and the secret key could be used to open the repository.

The vS3b (virtual S3 backend, for now) authenticates the client, opens the repository, and accepts PUTs into a new snapshot. Until the stream of data ends for more than x time.
It would also somehow need to assure the uniqueness of every client (by host identifier, source IP or something else). So if two clients would happen to use the same key and backup at the same time, they still end up two independent backups (locking could also do that more easily)

… Well, the more I think about it, the more I realised that implementing that directly into restic would be hard or not wanted.

But maybe there are some people out there with the same problem/requirment?