Handling of large immutable storages, best practice

Hi,

I manage a large immutable blobstore that I back up every few hours to S3 bucket. While Restic has been effective thus far, I’ve noticed performance issues as the blobstore continues to grow.

Currently, each backup involves Restic scanning millions of files to detect changes. Although only new files are added, this scanning process takes several hours.

Given the immutability of my storage, I believe Restic could bypass the scanning of previously backed-up files. However, using the --exclude option isn’t feasible, as it would remove these files from the snapshot, complicating future restores. My goal is to include all files in the storage, by taking the already backed up files over from the previous snapshot, thereby reducing backup time.

Is there a way to achieve this? For context, my storage structure is organized as follows:

storage/<year>/<day_number>/<filename>

Ideally I would check for the dirs already present in the latest snapshot (restic ls -l) and take these paths as is over to the current one. For this I did not find any option.
I could imagine as a workaround to split the storage snapshot into smaller storage/year snapshots and grouping them together with the tag to make a complete restore bundle. This would complicate the backup script, pollutes backed up --paths and require complete year scan in December.

I wanted to check first if I do miss something more simple and straightforward.

Best regards

Question: if you don’t delete files that you want to have in an earlier snapshot, doesn’t this sound more like a job for rsync/rclone than restic? What’s the advantage you gain from using restic?

The blobstore snapshot, along with three other mutable snapshots, constitutes a complete system backup. These components must be consistent with each other and treated as a single, versioned backup with an associated timestamp. I should be able to restore specific past versions of the entire backup, not just the most recent one.

Furthermore, I need the backup to be both encrypted and deduplicated.

Okay understood. restic does make sense in that case :innocent: I’m pretty sure restic itself can’t solve that problem but maybe anyone else has a clever idea?

The --no-scan option might help a bit with performance. If the file metadata hasn’t changed, then restic is only busy with listing the directory contents. To avoid that you’d need some other way to compile a list of new/changed files and some future functionality like that suggested in Option to create cumulative snapshots · Issue #4804 · restic/restic · GitHub .

It’s on the roadmap for the next restic version, but progress in that direction is currently much slower than I’d like.

Thanks for the hint, the --no-scan option reduces the time from 60 mins down to 42:

using parent snapshot 917f4a51                                                                                              
                                                              
Files:           1 new,     0 changed, 1504856 unmodified                                                                   
Dirs:            0 new,     5 changed,   250 unmodified                                                                                                                                                                                                 
Added to the repository: 495.521 KiB (131.352 KiB stored)                                                                   
                                                                                                                                                                                                                                                        
processed 1504857 files, 1005.872 GiB in 42:02                                                                              
snapshot 3d81d800 saved   

The linked issue looks like what we need. Upvoted.

Best regards

1 Like