One of my backups took longer than I expected and I wanted to know why, so I created a small Python script that has no 3rd party dependencies: find-restic-anchor. It runs a few Restic commands and shows which files added or modified in the latest backup are the largest. Maybe others will find this helpful too. Feedback is welcome.
“Also, find-restic-anchor doesn’t list files that don’t exist locally anymore, and it shows the current local size of the files, not necessarily the size they were when they were backed up.”
This is actually quite a strong restriction as you never know what you miss here… I suggest you combine with a ls
run and capture the sizes from the snapshot.
Also, instead of showing the on-disc-size of the files, a better information would be to show the size-added-to-repository as this is the interesting information (though in many cases the two will correlate).
The problem is, that with restic there is a no functionality which shows or enables to extract that needed information on a file basis, AFAIK.
FWIW, as a coincidence I started to implement exactly this functionality in rustic recently, see this rustic Discussion. To try it out, get the latest nightly build and use the interactive mode:
- run
rustic snapshots -i
- select the snapshot
- type
D
to get to the diff view (diff with parent snapshot is default) - either directly navigate through the changes or type
s
to compute summaries over all sub-dirs. - note that
d
toggles view of identical entries andm
toggles ignoring only metadata changes
This is actually quite a strong restriction as you never know what you miss here
Thought about it and I agree. I updated the script so that it now gets the file sizes from the repository instead of the disk. I was expecting that to have a larger performance impact, but it didn’t change much and is still fast enough for me. Find-restic-anchor runs in about 7.9 seconds on my 18.2 GB total (about 100 snapshots) stored in Backblaze B2. I bet Rustic’s version of the feature runs several times faster though, and a Restic version probably would too.
man restic-ls
restic ls --long --human-readable --sort=size [latest or SNAPSHOT_ID] | tail
Yes, find-restic-anchor uses restic ls
and a few other Restic commands to figure out why the last backup took longer or was larger than normal. The output of restic ls
includes files that didn’t change, so other commands are needed to filter those out.
Maybe I should add a --human-readable
option to find-restic-anchor? The script currently shows byte numbers without converting to GiB, MiB, etc.
I think we must separate 3 use cases:
- find the largest (by restore-size) files in the latest snapshot
- find the largest (by restore-size) files in the latest snapshot which where added or modified compared to its ancestor
- find the files which contributed most to the backup size, the latest snapshot added to the repository
For 1), use the solution proposed by @Ilya, for 2) use find-restic-anchor
. (And rustic’s interactive mode is able to answer all above questions)
In fact not that much. The main problem of find-restic-anchor
is that it has to call restic twice, i.e. the index is loaded twice and the trees are traversed two times. And the third questions needs to compare lists of blob id used in files (and query the index for those) which AFAIK are not available by restic commands unless you use only low-level restic commands like restic list
and restic cat
and do everything yourself.
I had a feeling something like that was happening. I’ll have to learn more about how Restic works sometime, it sounds interesting.
Thanks for sharing your script. I also wanted know what files where taking the most room. I log every backup so worked from that angle. I found that during the backup operation --verbose did not show any of the individual file changes and that --verbose=2 would list every file in the repo. I tried a run with --verbose=2 but ended up with a log file of may be a GB so it was not to helpful. Just recently I finally figured out and tested combining --verbose=2 option with grep and I like the results so far. I think this option would work good as long as the number of files changing on each run is less then a few hundred or maybe a thousand.
The backup command in my script is something like this:
restic backup [options] --verbose=2 | grep -Ev '^unchanged /|/, saved in [0-9]*\.[0-9]+s \(0 B added, '
The first part of the regex is to remove unchanged files and the last part is to remove directories.
Don’t forget that Restic natively supports JSON output for various command which is designed to be ingested by other tools like jq for parsing and getting this kind of data you are looking for.