Inspiration for requirement "s3 to s3 backups"

lobsang · April 22, 2021, 8:17am

Hello everyone,

I am asking this question here in hope for inspiration.

I have the requirement to perform s3-to-s3 backups in two different flavours:

application data is kept in s3 bucket. backup strategy asserts using s3 storage for backups
backups are kept in s3 bucket. backups should be replicated into separate s3 bucket in different availability zone

Generally speaking I want the backups to be structured logically in terms of snapshots with application of retention policies. At-rest encryption for backup data is also a mandatory requirement. Optimization for minimal storage and data transfer is also a very strong requirement. Obviously Restic is the perfect fit. Unfortunately Restic can only operate on block storage input.

So what are my options here?

The following aspects are important to me when evaluating possible approaches:

s3 bucket space requirements
block storage space requirements
transfer-out traffic (source bucket)
traffic between availability zones
time requirement per snapshot creation

The following approaches refer to the application data scenario.

Naive approach #1 (not using Restic): use rclone and self-roll a lightweight logic for realizing snapshots

Using rclone, perform periodical full s3 to s3 copies, each into a pristine directory tree (let’s say daily). In smaller time periods (say hourly) perform syncs to the current time slot directory tree. Clean up directory trees moving out of a sliding window (i.e. last 14 days).

This approach is not good enough for several reasons: Space requirements, total loss of control regarding traffic and time (depending on how rclone handles s3 to s3 sync, does it download the whole source bucket?)

So, what can I do to bring Restic into play?

Naive approach using Restic #1: intermediary copy to block storage, then use Restic

Referring to the application data scenario.

For each backup-run I can transfer all data from the source bucket to a target block storage and use restic from there. This takes too long and causes worst-case transfer-out traffic from the source bucket.

Alternatively I can keep a permanent mirror of the application data on block storage and periodically sync the source bucket to block storage (for example using rclone).

Basically both alternatives are a no-go due to block storage space requirements alone.

Naive approach using Restic #2: expose source bucket using fusefs, use restic on top

To me this is only a theoretical possibility, because I lose control to what’s happening, in particular with regards to random seeks. Does the driver still download the whole object?

Is there any way that Restic can theoretically perform s3 to s3 backups in the future? Or is that out of the question due to first principles (like random seeks in input data?).

alexweiss · April 23, 2021, 3:09pm

I don’t know the details of the fuse FS. But in fact, restic only needs to sequentially read all files. More important is the change detection, i.e. which files restic needs to read at all. If restic can be sure that the file has not been changed it is not read at all. This decision is based on inodes, file name and time stamps. There are options like --ignore-inode which can help with your fuse FS. But best is you just run a backup, unmount, mount again an re-run the backup to see if your files in the fuse dir are re-read.

It is theoretically possible. But I think you can easily start with the proposed fuse FS.

tomwaldnz · April 25, 2021, 7:27pm

Have you consider S3 native replication with object versioning and lifecycle policies? It might not perfectly fit your use case but given it’s native it would be easier and likely cheaper.

lobsang · April 26, 2021, 6:44am

Thank you for your feedback.

I looked a bit more into fuse solutions over the weekend. Looks like there are several options to try, goofys looks good for a first try, it does not try to implement operations which are difficult to realize with s3 mechanics. If that works I can be reasonably sure, that there are no funny things happening below the surface.

lobsang · April 26, 2021, 6:46am

Thank you for the suggestion. Unfortunately our s3 storage service provider does not support native replication.

tomwaldnz · April 26, 2021, 7:31am

So, you’re not using AWS S3, you’re using a storage provider with an AWS S3 compatible interface? Does that provider offer backups as a service?

You’d probably be better off using a product made to handle this rather than trying to get Restic to do something it’s not really made for. CloudBerry Backup (now MSP 360 backup) is quite flexible and might be able to do this for you.

Nyxton · March 5, 2024, 11:24am

Did you manage to make this solution work? I am currently in the situation where I need to perform a backup of an S3 bucket to another S3 bucket, which are provided by Ceph.

Is there something else you did if this solution didn’t work out in the end?