Restic grafana dashboard

With the 0.9.5 release we can now stream JSON output during backup operations. I had a quick play with my ‘TIG’ stack at home and was able to display some interesting metrics very easily.

I’d be curious to know if anyone has done anything similar/better as I’m little more than a beginner with such things.

Thanks for this new feature :+1:

7 Likes

In my case I only export stats from restic stats, duration and exit codes to prometheus (with node-exporter’s textfile collector). So I can have a dashboard and alerts (if the backup size drops more than some percent, if it didn’t run last night…).

I use jq to format restic stats’s data:
stats --json latest | jq -r '"restic_stats_total_size_bytes \(.total_size)\nrestic_stats_total_file_count \(.total_file_count)"' > restic.prom.

and for the snapshots:
snapshots --json | jq -r 'max_by(.time) | .time | sub(".[0-9]+Z$"; "Z") | fromdate | "restic_stats_last_snapshot_timestamp \(.)"' >> restic.prom

2 Likes

I like the idea of having the most recent snapshots displayed and the time they were taken. I think I will add that information too.

Very nice… as an absolute Grafana newbie (i.e. I just got it running a few minutes ago)… any pointers you can give as to how you configured the queries, and what source you are using? Are you pulling with telegraf, or direct to Grafana somehow?

Thanks!

Yeah you bet. I use the tail input plugin for telegraf which looks at the log files that restic generates and feeds them into my influxdb database.

The grafana query to influxdb looks something like this "SELECT last(\"percent_done\") FROM \"tail\" WHERE (\"path\" = '/your/path/backup.log')

Let me know if you need any more detail

Sorry I failed to get back, but appreciate the pointer!

That worked great for me.

restic snapshots latest --json | jq -r 'max_by(.time) | .time | sub("[.][0-9]+"; "") | sub("Z"; "+00:00") | def parseDate(date): date | capture("(?<no_tz>.*)(?<tz_sgn>[-+])(?<tz_hr>\\d{2}):(?<tz_min>\\d{2})$") | (.no_tz + "Z" | fromdateiso8601) - (.tz_sgn + "60" | tonumber) * ((.tz_hr | tonumber) * 60 + (.tz_min | tonumber)); parseDate(.) | "restic_last_snapshot_ts \(.)"' > restic.prom.$$

and

restic stats latest --json | jq -r '"restic_stats_total_size_bytes \(.total_size)\nrestic_stats_total_file_count \(.total_file_count)"' >> restic.prom.$$

Thanks a lot for that!

I had to make a small change in getting the timestamp to work across servers with different TZ settings, so i did some regex replacements :slight_smile:

1 Like

I’ve also created a dashboard based on parsing the backup --json output after every job. With the newest version the json summary output of restic backup --stdin is also fixed. I generally take the last (sometimes i get multiple) summary result (of the backup job) and simply transform the json object into the influx line protocol.

path=/
json=$(/usr/local/bin/restic.sh backup --json --exclude-caches --exclude-file /etc/restic/excludes --tag dir $path)
rc=$?
sid=$(echo "$json" | jq -s -r '. | map(select(.message_type | contains("summary"))) | .[length-1] .snapshot_id')
if [ "$sid" != "" ] ; then 
    tags="host=$(hostname),type=backup,tag=dir,path=$path,snapshot_id=$sid"
    stats=$(echo "rc=$rc" ; echo "$json" | jq -s '. | map(select(.message_type | contains("summary"))) | .[length-1] | del(.message_type) | del(.snapshot_id)' | sed -e 's/[\{\}", ]//g' -e 's/:/=/g' | grep -v ^$)
    curl -s -XPOST \
        "http://${INFLUX_HOST}:${INFLUX_PORT}/write?db=${INFLUX_DB}" \
        -u "${INFLUX_USER}:${INFLUX_PASS}" \
        --data-binary "jobs,$tags $(echo $stats | tr " " ",")"
fi

This combined with the rest_server prometheus output and and filesystem stats (total backup size) and you can end up with something like this:

Sure there’s always room for improvements (especially when I look at processing the stream during backups runs).

7 Likes

this looks awesome could you share your dashboard I am very interested in the influx queries from Grafana

These are just very basic influx queries and very easy to create using the graphical query builder.

But here are some examples:

# Backup Growth Per Day
SELECT sum("data_added") FROM "jobs" WHERE ("host" =~ /^$backuphost$/ AND "tag" =~ /^$backuptags$/) AND $timeFilter GROUP BY time(1d), "host" fill(null)

# Backup Size Per Day
SELECT sum("total_bytes_processed") FROM "jobs" WHERE ("host" =~ /^$backuphost$/ AND "tag" =~ /^$backuptags$/) AND $timeFilter GROUP BY time(1d), "host" fill(null)

# Backup Jobs Table
SELECT "data_added" AS "added", "total_bytes_processed" AS "total_size" FROM "jobs" WHERE ("tag" =~ /^$backuptags$/ AND "host" =~ /^$backuphost$/) AND $timeFilter GROUP BY "host", "path", "tag", "snapshot_id" ORDER BY time DESC

# Backup Size Table
SELECT sum("total_bytes_processed") AS "total_size" FROM "jobs" WHERE $timeFilter GROUP BY "host" ORDER BY time DESC

Very simple queries. Not a single query needed to be done in raw sql mode. Hope that helps

Here’s my alert rules. Inspired by gitlab’s postmortem of the 2017 data loss incident. (TLDR: some of their backups were failing, and they didn’t know about it)

# ensure there was 1 backup in the last 24 hours
absent(restic_stats_last_snapshot_timestamp) or (time() - restic_stats_last_snapshot_timestamp) / 3600 > 24

# ensure there was 1 snapshot in the last 24 hours
absent(restic_last_run_timestamp) or (time() - restic_last_run_timestamp) / 3600 > 24

# alert if total size drop by more than 10% since yesterday
absent(restic_stats_total_size_bytes) or restic_stats_total_size_bytes OFFSET 1d - restic_stats_total_size_bytes > restic_stats_total_size_bytes OFFSET 1d * 0.1

# alert if file count or total size doesn't change
absent(restic_stats_total_size_bytes) or rate(restic_stats_total_size_bytes[1d]) == 0 and rate(restic_stats_total_file_count[1d]) == 0

I’ve created restic2influx which feeds the restic status output into influxdb and allows you to visualize statistics from previous backup runs as well as the live status of currently running backups. Feel free to give it a try:

1 Like

these dashboard are really great.
My setup is a bit different as I run a rest-server with restic and have different machines make backup to the rest-server.
Is there a way to centrally to fill the influx from the rest server?

my service file looks like this

[Unit]
Description=Rest Server
After=syslog.target
After=network.target

[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/local/bin/rest-server --prometheus --path /mnt/backup/restic
Restart=always
RestartSec=5
StartLimitInterval=0

[Install]
WantedBy=multi-user.target

Can the rest server also provide the statics of the different repositories ?

I also run the pruning on the server , so the repository passwords are known by the pruning scripts. By that i could also add some new scripts to collect statics fro the individual repositories.

Does some has similar setup and already has some scripts by hand?

thx