I am asking this question here in hope for inspiration.
I have the requirement to perform s3-to-s3 backups in two different flavours:
- application data is kept in s3 bucket. backup strategy asserts using s3 storage for backups
- backups are kept in s3 bucket. backups should be replicated into separate s3 bucket in different availability zone
Generally speaking I want the backups to be structured logically in terms of snapshots with application of retention policies. At-rest encryption for backup data is also a mandatory requirement. Optimization for minimal storage and data transfer is also a very strong requirement. Obviously Restic is the perfect fit. Unfortunately Restic can only operate on block storage input.
So what are my options here?
The following aspects are important to me when evaluating possible approaches:
- s3 bucket space requirements
- block storage space requirements
- transfer-out traffic (source bucket)
- traffic between availability zones
- time requirement per snapshot creation
The following approaches refer to the application data scenario.
Naive approach #1 (not using Restic): use rclone and self-roll a lightweight logic for realizing snapshots
Using rclone, perform periodical full s3 to s3 copies, each into a pristine directory tree (let’s say daily). In smaller time periods (say hourly) perform syncs to the current time slot directory tree. Clean up directory trees moving out of a sliding window (i.e. last 14 days).
This approach is not good enough for several reasons: Space requirements, total loss of control regarding traffic and time (depending on how rclone handles s3 to s3 sync, does it download the whole source bucket?)
So, what can I do to bring Restic into play?
Naive approach using Restic #1: intermediary copy to block storage, then use Restic
Referring to the application data scenario.
For each backup-run I can transfer all data from the source bucket to a target block storage and use restic from there. This takes too long and causes worst-case transfer-out traffic from the source bucket.
Alternatively I can keep a permanent mirror of the application data on block storage and periodically sync the source bucket to block storage (for example using rclone).
Basically both alternatives are a no-go due to block storage space requirements alone.
Naive approach using Restic #2: expose source bucket using fusefs, use restic on top
To me this is only a theoretical possibility, because I lose control to what’s happening, in particular with regards to random seeks. Does the driver still download the whole object?
Is there any way that Restic can theoretically perform s3 to s3 backups in the future? Or is that out of the question due to first principles (like random seeks in input data?).