Finding the last snapshot, fast

gurkan · June 21, 2024, 1:51pm

Hi

I have a big repository on s3 endpoint (~10k snapshots) which is serving to multiple hosts. But if want to test the latest backup for any specific host, scanning for the correct snapshot ID seems to be slowest part of the restore.

Currently trying to find it with restic snapshots -H desiredhostname --latest 1 to grab the ID to pass it to restore command.

As an alternative, I am planning to place a last_snapshot_ids file in the bucket and refresh it from clients after each successful backup operation, since (thankfully) restic does not wipe extra files/folders in a repository while pruning so I can reach to snapshot time in O(1)

(I am aware this will leak the hostnames of clients, but that’s a non-risk for my case)

Is there a planned thing or a parameter I am missing that I can utilize within the Restic for such thing?

Thanks!

rawtaz · June 21, 2024, 2:09pm

It’s somewhat unclear, are you saying that you think the restore command’s way of getting to the latest snapshot, when you run e.g. restore -H desiredhostname latest is slow?

Other than the command above I’m not aware of a quicker way to do it with restic. I mean, it does have to load information about the snapshots in order to be able to find which is the most recent one, in particular when you filter it on hostname.

Other than that, outside of restic, you could look at the timestamps of the snapshot files to get the most recent one and the ID as its filename, unless you save the snapshot ID at the time of backup like you suggested.

gurkan · June 21, 2024, 2:23pm

Thanks for the quick reply!

Restore itself is fast enough, but it “needs” a snapshot ID to work, afaik it is not optional. And I’d like to restore the latest snapshot for a host, which is currently only obtainable by listing snapshots with --latest 1 argument. This part is the slow one since I have a lot of snapshots.

Also thought of that but there is no way of which snapshots are coming from the host I need to restore, since it is a multi-host repo.

rawtaz · June 21, 2024, 2:27pm

What makes you think that you have to give restore a snapshot ID? You can just give it latest as per my suggestion, and it will restore the latest snapshot.

If you filter on a hostname by also adding -H desiredhostname then the latest snapshot for that hostname will be restored, which seems to be what you want.

gurkan · June 21, 2024, 2:58pm

What makes you think that you have to give restore a snapshot ID? You can just give it latest as per my suggestion, and it will restore the latest snapshot.

That… is a very good point. I missed that feature

The special snapshotID "latest" can be used to restore the latest snapshot in the repository.

I will test the speed difference. Thanks!

rawtaz · June 21, 2024, 3:16pm

Very nice, let us know the difference, considering you have a lot of snapshots there

alexweiss · June 22, 2024, 4:39am

If reading snapshot files is the limiting factor, a restic snapshots will be as slow as restic restore latest as in both cases all snapshots have to be read, as @rawtaz already pointed out.

However, the client only needs the list of snapshot IDs from the storage backend, the actual snapshot files are read from the cache if you use one and if they are present there. @gurkan Are you using a cache in your case?

alexweiss · June 22, 2024, 4:54am

If you really want to make a duplicated data storage for such kind of information, I suggest to use a simple relational database containing the snapshot ID and other information you might want to query.

Actually I think that your large use case involves much more things to consider than “only” a dedicated storage for querying existing snapshots. You might want to have a central scheduling and also some additional status information like list failed backups with reasons. Also you might want to have much more statistics like some kind of time-series for backup-information about all host/paths.

I started such a project in GitHub - rustic-rs/rustic_scheduler: Schedule rustic backups for many clients to a common repository but it is in a very early phase and I am currently missing time to work on it…

gurkan · June 22, 2024, 5:57pm

Ouch. Alright then it feels like I did something right. Now my backup client wrapper also records the latest snapshot id (returned from restic) to the repo for each host.

I cannot use cache, since this is for automated disaster recovery tests which spawns a fresh machine to restore from repository, I can’t assume anything ready other than credentials to a backblaze endpoint

Luckily (or sadly) this is all done on my side way before. I got a handmade multi-threaded queue in Python, which can even handle restic’s lack of live-prune. But I am also trying to separate disaster recovery from my “single source of truth” a bit.
I saw the scheduler recently though, neat (and necessary) idea!