Anomaly detection in backup snapshots

I’ve been thinking about the topic of anomaly detection in snapshots for a long time. At the beginning just as an idea, but currently also as a concrete implementation of specific patterns. I would like to hear your ideas and feedback.

In our setups we run daily prune jobs. Backups are created on append-only backends. Restic documentation reference for example empty snapshot attacks with append-only mode backends. Before pruning we run an anomaly detection script. If a anomaly is detected, the prune job will not be executed

What I do for now is to run restic stats --json --mode raw-data latest latest^1 and compare total_blob_count_ratio and total_uncompressed_size_ratio metrics. My current interpretation is that total_uncompressed_size_ratio represents uncompressed restore size of a complete snapshot. total_blob_count_ratio represents unique data shunks. We calculate the ratio between latest and pre latest and interpret the ratio as follows:

100 : same size
>100: latest snapshot has more data then previous
<100: latest snapshot has less data then previous
50 : latest snapshot has half the size then previous

So when backup data drops drastically I interpret this as anomaly. Even if that would of course be possible normal behaviour, better safe than sorry.

The age of the last snapshot is also compared. Backups older than 3 days for example could also be an anomaly.

I’m very interested in your opinion and secretly I wish there is an anomaly detection command directly in restic for typical possible abnormalities.

1 Like

Some discussion on this relevant issue.

Good idea. We normally send the metrics you mentioned to grafana (plus others) and keep it on record so we can see the hosts’ backup size history.

Also creating alerts on applicable hosts without waiting prune time (e.g. keeping average backup size from last ~10 snapshots and alerting if backup size on a new snapshot dropped more than 40% etc).

I suspect detection of these kind of issues by restic itself would not be optimal, since it’s valid to have “empty” snapshots for some environments.