Advanced usage/churn statistics

Hi there,

I’m using restic to backup a few machines into a shared repository. In order to keep disk usage low, and avoid uploading useless data, I’d like to understand what kind of data is in my repository.

  • Is there a some file/directory that changes a lot? If so, I might want to exclude it from backups, if the content is not that important. Example output: some stupid browser cache that I forgot to exclude.
  • Which directories hog a lot of disk space in my repository? I envision something like du -hx --maxdepth 2, aggregating data from multiple snapshots and, possibly, multiple hosts. Example output: some old movie files that shouldn’t be in my backup.

I’d be very happy to learn about tools and scripts that can help me extract this kind of information out of the repository metadata.

Thanks
Carsten

Using restic diff with two snapshots will output a list of files that were added, removed, and modified from the first snapshot to the second. This is a good way to identify these kinds of directories.

You could indeed use du in combination with restic mount. Just be aware that du may over-report how much space is used, since it does not know about restic’s deduplication. (du has the same issue with e.g. btrfs when processing files with compressed and/or shared extents.)

Thanks, I’ll give restic diff a try. I’d love to see some aggregation logic around du so that I don’t have to do it myself, though. But it’s a start, working on fusefs seems to be easier than parsing the index files.

The btrfs folks have been trying to make btrfs fi du and btrfs fi df make sense for years, and it turns out it’s really hard to summarize usage data when arbitrary deduplication is possible. I’m not saying it can’t happen, but I wouldn’t expect it soon (unless you want to work on it :slight_smile: ).

1 Like

On a windows machine using a junky python program, got the two most recent snapshots and did a diff did a ls on the recent one then used Pandas to summarize file sizes. This helped a bit to understand how much Microsoft puts on my personal computer for the promotion of things like XBox, Phone etc. A potential problem is that a log file might be appended by just a little bit but the ls would show the whole size not the incremental backup quantity.