Corrupted file system and failing backup user story

Hi all,

I had over the weekend an interesting file system encounter that coupled somewhat to my restic backup. Testic worked as intended and never was at fault, I learnd a bit about my backup approach/set up, so I would like to retell it.

I have been running backups from my laptop to a remote bucket with restic for some time. For that, I had wrapped restic in a small bash script, that gets started regularly by systemd service/timer units.

However, I only noticed yesterday, that no backup run had succeeded since a few month - i.e., unnoticed, the unit started but never had finished successfully.

restic backup had always been throwing an error message

Fatal: unable to save snapshot: node "prefs.js" already present

and exited. But since my wrapping bash script has not been in a “safe mode”, the exit code was followed by a few more lines (mostly echos) finishing successful so that the whole unit returned 0 to systemd and everybody was (superficially) happy.

It turned out to be a file system corruption. My volume is on btrfs (seeing why RHEL has dropped btrfs for good some time ago…). While another unit regularly scrubs and rebalances the volume, it did not catch the error: a file prefs.js had multiple entries in its parent directory, i.e., it appeared three times in the diorectory while seeming to be the same(??) file (only one inode for what it’s worth in btrfs).
While I could read the file and cat it into a new one, I could only delete one entry and was mot able to delete the dangling other two entries (no obvious white spaces in the names or so). I “fixed” it in the end by tar’ing the whole directory tree, removing the whole dir/tree and unpacked the tar again.

Now my scripts are using the “unofficial bash strict mode”, that should exit for good and throw a proper error towards systemd, when a problem occurs.
http://redsymbol.net/articles/unofficial-bash-strict-mode/

My main lessions are, that even with scrubs (apparently only on blocks but not checking trees) a brtfs can be in an unhealthy state and that a wrapping backup script should fail for good :wink:

Cheers,
Thomas

3 Likes

Heh, thanks for reporting your experience!

I’ve discovered the bash strict mode a few years ago, and since then all my bash scripts (if I cannot avoid writing bash at all) start with:

#!/bin/bash
set -euo pipefail
IFS=$'\n\t'
2 Likes

The next restic version will downgrade that error to a warning: archiver: Improve handling of "file xxx already present" error by MichaelEischer · Pull Request #3880 · restic/restic · GitHub .

A common trick to be on the safe side is to have a second cronjob, which just checks that new snapshots have been created in the last few days. That way if the first script fails/isn’t run for some reason, then there’s a second one to warn you early on.

2 Likes

Strict mode sounds very interesting, thanks! I searched for it and also came across this interesting trap that shows you which command in which line failed. Nice!

Silent failure is a huge problem, I’m using the excellent healthchecks.io; it will send you notifications via email or other other services, when it did not receive any positive messages from your UUID for the specified amount of time OR if it receives a negative submission (in this case, a non-zero exit code):

#!/bin/sh


HEALTHCHECKIO_UUID="UUID"

healthcheckio () {
 curl -fsS -m 10 --retry 5 -o /dev/null \
  "https://hc-ping.com/${HEALTHCHECKIO_UUID}/$1" \
  --data-raw "$2"
}


# signal backup is starting
healthcheckio start

# backing up, capture stderr and stdout
RESTICOUTPUT=$(restic YOURRESTICOMMANDS 2>&1)
RESTICEXITCODE=$?

# signal exit code, stderr/stdout to healthcheckio
healthcheckio "$RESTICEXITCODE" "$RESTICOUTPUT"

exit "$RESTICEXITCODE"

Read their excellent documentation: