Restic grafana dashboard

UhtredTheBold · May 2, 2019, 11:04am

With the 0.9.5 release we can now stream JSON output during backup operations. I had a quick play with my ‘TIG’ stack at home and was able to display some interesting metrics very easily.

I’d be curious to know if anyone has done anything similar/better as I’m little more than a beginner with such things.

Thanks for this new feature

bbigras · May 2, 2019, 2:39pm

In my case I only export stats from restic stats, duration and exit codes to prometheus (with node-exporter’s textfile collector). So I can have a dashboard and alerts (if the backup size drops more than some percent, if it didn’t run last night…).

I use jq to format restic stats’s data:
stats --json latest | jq -r '"restic_stats_total_size_bytes \(.total_size)\nrestic_stats_total_file_count \(.total_file_count)"' > restic.prom.

UhtredTheBold · May 2, 2019, 3:19pm

I like the idea of having the most recent snapshots displayed and the time they were taken. I think I will add that information too.

bdillahu · May 3, 2019, 3:54am

Very nice… as an absolute Grafana newbie (i.e. I just got it running a few minutes ago)… any pointers you can give as to how you configured the queries, and what source you are using? Are you pulling with telegraf, or direct to Grafana somehow?

Thanks!

UhtredTheBold · May 3, 2019, 8:08am

Yeah you bet. I use the tail input plugin for telegraf which looks at the log files that restic generates and feeds them into my influxdb database.

The grafana query to influxdb looks something like this "SELECT last(\"percent_done\") FROM \"tail\" WHERE (\"path\" = '/your/path/backup.log')

Let me know if you need any more detail

bdillahu · May 7, 2019, 2:38am

Sorry I failed to get back, but appreciate the pointer!

zoispag · March 6, 2020, 3:19pm

That worked great for me.

restic snapshots latest --json | jq -r 'max_by(.time) | .time | sub("[.][0-9]+"; "") | sub("Z"; "+00:00") | def parseDate(date): date | capture("(?<no_tz>.*)(?<tz_sgn>[-+])(?<tz_hr>\\d{2}):(?<tz_min>\\d{2})$") | (.no_tz + "Z" | fromdateiso8601) - (.tz_sgn + "60" | tonumber) * ((.tz_hr | tonumber) * 60 + (.tz_min | tonumber)); parseDate(.) | "restic_last_snapshot_ts \(.)"' > restic.prom.$$

and

restic stats latest --json | jq -r '"restic_stats_total_size_bytes \(.total_size)\nrestic_stats_total_file_count \(.total_file_count)"' >> restic.prom.$$

Thanks a lot for that!

I had to make a small change in getting the timestamp to work across servers with different TZ settings, so i did some regex replacements

griffon · March 11, 2020, 11:15pm

I’ve also created a dashboard based on parsing the backup --json output after every job. With the newest version the json summary output of restic backup --stdin is also fixed. I generally take the last (sometimes i get multiple) summary result (of the backup job) and simply transform the json object into the influx line protocol.

path=/
json=$(/usr/local/bin/restic.sh backup --json --exclude-caches --exclude-file /etc/restic/excludes --tag dir $path)
rc=$?
sid=$(echo "$json" | jq -s -r '. | map(select(.message_type | contains("summary"))) | .[length-1] .snapshot_id')
if [ "$sid" != "" ] ; then 
    tags="host=$(hostname),type=backup,tag=dir,path=$path,snapshot_id=$sid"
    stats=$(echo "rc=$rc" ; echo "$json" | jq -s '. | map(select(.message_type | contains("summary"))) | .[length-1] | del(.message_type) | del(.snapshot_id)' | sed -e 's/[\{\}", ]//g' -e 's/:/=/g' | grep -v ^$)
    curl -s -XPOST \
        "http://${INFLUX_HOST}:${INFLUX_PORT}/write?db=${INFLUX_DB}" \
        -u "${INFLUX_USER}:${INFLUX_PASS}" \
        --data-binary "jobs,$tags $(echo $stats | tr " " ",")"
fi

This combined with the rest_server prometheus output and and filesystem stats (total backup size) and you can end up with something like this:

Sure there’s always room for improvements (especially when I look at processing the stream during backups runs).

schubter · April 9, 2020, 7:11am

this looks awesome could you share your dashboard I am very interested in the influx queries from Grafana

griffon · April 13, 2020, 12:23am

These are just very basic influx queries and very easy to create using the graphical query builder.

But here are some examples:

# Backup Growth Per Day
SELECT sum("data_added") FROM "jobs" WHERE ("host" =~ /^$backuphost$/ AND "tag" =~ /^$backuptags$/) AND $timeFilter GROUP BY time(1d), "host" fill(null)

# Backup Size Per Day
SELECT sum("total_bytes_processed") FROM "jobs" WHERE ("host" =~ /^$backuphost$/ AND "tag" =~ /^$backuptags$/) AND $timeFilter GROUP BY time(1d), "host" fill(null)

# Backup Jobs Table
SELECT "data_added" AS "added", "total_bytes_processed" AS "total_size" FROM "jobs" WHERE ("tag" =~ /^$backuptags$/ AND "host" =~ /^$backuphost$/) AND $timeFilter GROUP BY "host", "path", "tag", "snapshot_id" ORDER BY time DESC

# Backup Size Table
SELECT sum("total_bytes_processed") AS "total_size" FROM "jobs" WHERE $timeFilter GROUP BY "host" ORDER BY time DESC

Very simple queries. Not a single query needed to be done in raw sql mode. Hope that helps

bbigras · September 17, 2020, 3:26pm

Here’s my alert rules. Inspired by gitlab’s postmortem of the 2017 data loss incident. (TLDR: some of their backups were failing, and they didn’t know about it)

# ensure there was 1 backup in the last 24 hours
absent(restic_stats_last_snapshot_timestamp) or (time() - restic_stats_last_snapshot_timestamp) / 3600 > 24

# ensure there was 1 snapshot in the last 24 hours
absent(restic_last_run_timestamp) or (time() - restic_last_run_timestamp) / 3600 > 24

# alert if total size drop by more than 10% since yesterday
absent(restic_stats_total_size_bytes) or restic_stats_total_size_bytes OFFSET 1d - restic_stats_total_size_bytes > restic_stats_total_size_bytes OFFSET 1d * 0.1

# alert if file count or total size doesn't change
absent(restic_stats_total_size_bytes) or rate(restic_stats_total_size_bytes[1d]) == 0 and rate(restic_stats_total_file_count[1d]) == 0

_hn · April 30, 2021, 10:54am

I’ve created restic2influx which feeds the restic status output into influxdb and allows you to visualize statistics from previous backup runs as well as the live status of currently running backups. Feel free to give it a try:

kreutpet · May 2, 2021, 9:46am

these dashboard are really great.
My setup is a bit different as I run a rest-server with restic and have different machines make backup to the rest-server.
Is there a way to centrally to fill the influx from the rest server?

my service file looks like this

[Unit]
Description=Rest Server
After=syslog.target
After=network.target

[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/local/bin/rest-server --prometheus --path /mnt/backup/restic
Restart=always
RestartSec=5
StartLimitInterval=0

[Install]
WantedBy=multi-user.target

Can the rest server also provide the statics of the different repositories ?

I also run the pruning on the server , so the repository passwords are known by the pruning scripts. By that i could also add some new scripts to collect statics fro the individual repositories.

Does some has similar setup and already has some scripts by hand?

thx

Alexandr-Falcon99 · February 3, 2025, 6:53am

A similar question as kreutpet. Can someone share his recipe for monitoring a rest-server with several repositories?

creativeprojects · February 3, 2025, 7:27pm

Maybe it’s a bit dated now (in 4 years these tools could have changed a lot), but there was a recipe on how to do it (using resticprofile is not a requirement in this setup): resticprofile/contrib/grafana at master · creativeprojects/resticprofile · GitHub

(I can’t tell if it’s still working as I’m not using it)

GuitarBilly · February 4, 2025, 11:29am

@Alexandr-Falcon99 monitoring is a very wide topic. what monitoring are you looking for? I think that will determine the best solution for you.

FYI - for my personal backups (1 server, 1 workstation and four laptops, phones) I use a mix of:

restic crontab jobs with healthchecks.io status monitoring. “pass/fail” alarm.
rest-server with example “performance” monitoring set up from the github example:
GitHub - restic/rest-server: Rest Server is a high performance HTTP server that implements restic's REST backend API.
This i did as a learning exercise. That dashboard does not add a lot for me since I already have performance monitoring on the server via librenms.
npbackup dashboard. reference here
GitHub - netinvent/npbackup: A secure and efficient file backup solution that fits both system administrators (CLI) and end users (GUI)
and some discussion here:
prometheus Monitoring

Alexandr-Falcon99 · February 6, 2025, 1:28pm

Thanks for the advice.

Are you using librenms to monitor the server completely, or are you somehow monitoring the restic repository itself?

GuitarBilly · February 7, 2025, 6:45pm

@Alexandr-Falcon99 yes I use librenms for the entire server, where I run most things in docker.

rest-server with additional
prometheus
node_exporter
pushgateway
grafana
but also:
minecraft x3
photoprism x2
immich
backrest
rocketchat
jellyfin
syncthing
librenms

Librenms monitors the physical server overall cpu/network/memory/disk usage and things like temperature, voltages, fanspeed. I also have it monitor my NAS which i use as dumb storage.
The rest-server example dashboard: https://github.com/restic/rest-server/raw/master/examples/compose-with-grafana/screenshot.png
gives blob/data/index/keys/locks/snapshots read/write/delete throughput or operations. Most of these manifest themselves into regular cpu/network/disk usage.

For me the most added value comes from having alerts i.e. via LibreNMS (disk runs out of space, machine is down) or via “healthchecks.io” which alerts me if a machine does not respond to an hourly ping or backup job has not run or backup job has run with an error status.

Automated backup is golden, but useless if you run it without proper monitoring or without knowledge that you can successfully restore the backup.

hth.

deajan · February 28, 2025, 7:15pm

@GuitarBilly Lol… LibreNMS, i’ve developped a lot for this one a couple of years ago (the exec library is from me, and the wrapper is a full rewrite of me). Still running LibreNMS since it’s the best opensource solution I found so far for deeper network diagnostics.

Anyway, back to the subject, I’ve written a restic_metrics library that takes restic text or json output and parses it into prometheus metrics. The library is part of NPBackup, but can run in standalone mode, see npbackup/npbackup/restic_metrics/__init__.py at main · netinvent/npbackup · GitHub