Anomaly detection in backup snapshots

meise · October 17, 2023, 8:55am

I’ve been thinking about the topic of anomaly detection in snapshots for a long time. At the beginning just as an idea, but currently also as a concrete implementation of specific patterns. I would like to hear your ideas and feedback.

In our setups we run daily prune jobs. Backups are created on append-only backends. Restic documentation reference for example empty snapshot attacks with append-only mode backends. Before pruning we run an anomaly detection script. If a anomaly is detected, the prune job will not be executed

What I do for now is to run restic stats --json --mode raw-data latest latest^1 and compare total_blob_count_ratio and total_uncompressed_size_ratio metrics. My current interpretation is that total_uncompressed_size_ratio represents uncompressed restore size of a complete snapshot. total_blob_count_ratio represents unique data shunks. We calculate the ratio between latest and pre latest and interpret the ratio as follows:

100 : same size
>100: latest snapshot has more data then previous
<100: latest snapshot has less data then previous
50 : latest snapshot has half the size then previous

So when backup data drops drastically I interpret this as anomaly. Even if that would of course be possible normal behaviour, better safe than sorry.

The age of the last snapshot is also compared. Backups older than 3 days for example could also be an anomaly.

I’m very interested in your opinion and secretly I wish there is an anomaly detection command directly in restic for typical possible abnormalities.

ProactiveServices · October 17, 2023, 9:16am

Some discussion on this relevant issue.

gurkan · October 17, 2023, 11:33am

Good idea. We normally send the metrics you mentioned to grafana (plus others) and keep it on record so we can see the hosts’ backup size history.

Also creating alerts on applicable hosts without waiting prune time (e.g. keeping average backup size from last ~10 snapshots and alerting if backup size on a new snapshot dropped more than 40% etc).

I suspect detection of these kind of issues by restic itself would not be optimal, since it’s valid to have “empty” snapshots for some environments.

wrohdewald · April 22, 2024, 5:13pm

I get

Ignoring "latest^1": no matching ID found for prefix "latest^1"

So - how did you do that?

meise · April 22, 2024, 5:34pm

It’s kind of a typo and more misleading than I thought. You have to use the a real snapshot ID. latest or latest^1 git like syntax is not yet implemented as far as I know:

restic stats --json --mode raw-data e7a19300

wrohdewald · April 22, 2024, 6:38pm

I see, thanks