I’ve been thinking about the topic of anomaly detection in snapshots for a long time. At the beginning just as an idea, but currently also as a concrete implementation of specific patterns. I would like to hear your ideas and feedback.
In our setups we run daily prune jobs. Backups are created on append-only backends. Restic documentation reference for example empty snapshot attacks with append-only mode backends. Before pruning we run an anomaly detection script. If a anomaly is detected, the prune job will not be executed
What I do for now is to run restic stats --json --mode raw-data latest latest^1
and compare total_blob_count_ratio
and total_uncompressed_size_ratio
metrics. My current interpretation is that total_uncompressed_size_ratio
represents uncompressed restore size of a complete snapshot. total_blob_count_ratio
represents unique data shunks. We calculate the ratio between latest and pre latest and interpret the ratio as follows:
100 : same size
>100: latest snapshot has more data then previous
<100: latest snapshot has less data then previous
50 : latest snapshot has half the size then previous
So when backup data drops drastically I interpret this as anomaly. Even if that would of course be possible normal behaviour, better safe than sorry.
The age of the last snapshot is also compared. Backups older than 3 days for example could also be an anomaly.
I’m very interested in your opinion and secretly I wish there is an anomaly detection command directly in restic for typical possible abnormalities.