Backing up failing disk

max2k · September 17, 2024, 4:51am

Hello,

I’m trying to set up back up solution using restic for personal use. In the process of backing up a folder I found the source hard drive developed several bad blocks. Restic failed backup with message something about cyclic redundancy code error. Is this expected for the restic to quit backup in such situation or it’s possible to make it finish backing up what is possible skipping non-readable files in the process? It failed after encountering second corrupted file out of more than a million.

It appears to me this situation would be fairly common as backups become most valuable precisely when such problems occur. In that case letting it run to the end and possibly back up changes since last snapshot would probably be more useful than losing those changes altogether.

restic 0.17.1 compiled with go1.23.1 on windows/amd64

rawtaz · September 17, 2024, 11:24am

You did not provide the information you were asked for when you opened the editor for your post. Please complement it by adding the requested information (to see what it is, you can open the edit form for a new topic in the Getting Help category). Such technical information is needed to even know what the symptoms you are talking about looks like, people cannot make an assessment without seeing what you are seeing.

max2k · September 17, 2024, 7:58pm

I think I didn’t ask the question correctly: I don’t need help with backup troubleshooting (yet), the question was about overall restic approach when it comes to backing up data located on a drive that is starting to fail.

It appears to me restic would fail fast in this case and correspondent snapshot will not be created, this is at least what I observed. There’s no ‘partial’ snapshot concept in restic- it either creates a full one or none. Consequently, the files which are still readable on the source disk will not be backed up and their changes since the last successful snapshot will be lost when the drive completely fails eventually. Is my understanding correct?

Would this mean to minimize the loss one should create snapshots more often and have some notification in place to alert when restic fails creating the next snapshot?

rawtaz · September 17, 2024, 11:00pm

Technically a drive that starts to fail can create conditions that prevents restic from operating properly (outside of restic’s control) and conditions where restic is able to ignore a file that cannot be read and inform the user about this. Generally speaking restic tries to back up what it can.

How and when restic fails/bails/cancels due to <some issue> depends on the specific issue. Snapshots are created at the very end of the backup process, but data that restic was able to read and upload to the repository will not have to be uploaded again on the next backup.

If you never managed to create a snapshot, you can see if the recover command can help you get back what has been stored in the repository (see restic help recover).

Well, If you have a disk that starts misbehaving to the extent that restic fails to complete its backup, then yes, the changed/new/updated data for the files that it had not processed and uploaded to the repository will not be saved to the repository, and hence not be restorable either.

For the files that it did process before being interrupted, they will not have a corresponding snapshot, so you’d have to use the recover command to get their data back. However, in a situation like this, the normal course is of course to act immediately if your backups don’t run, and consider the last successful snapshot be the one that is your restore point for the data at hand.

In certain cases you would be able to re-run the backup but with problematic files/folders excluded, to make restic back up the other ones, in case it has problems continuing the backup process when encountering the problematic files (again, all of this depends on what the actual symptoms are).

You have to work off the assumption that your infrastructure is working as it should, and that when it doesn’t, you have the last backup you ran be the most recent restore point. Any data between that last restore point and “now” is just up to you to either back up/salvage, or just throw away. There are various ways to deal with that, e.g. by excluding or backing up specific files with restic, or just salvaging what you can off the failing disk if it is in bad shape. It all boils down to the actual symptoms in that specific situation.

Personally on the machines where I don’t just do manual backup runs after I changed some relevant data, I have hourly backup runs.

Regarding notifications, yes, you should of course have some kind of eye on or monitoring of the backup processes, to make sure that you detect when they don’t complete successfully

I don’t understand why you don’t want to include the complete commands you ran, any relevant env vars for them, and their complete output including error messages (although deduplicated in a sense-making way). Those CRC errors would be interesting to see.

max2k · September 18, 2024, 12:20am

Thanks a lot for the very detailed answer, it helps a lot.

I’ve been doing robocopy- based backups for years and they were sufficient for my needs until recent event where one of my computers was hacked over RDP protocol and its drives were encrypted. It seems they didn’t guess credentials correctly but rather hacked the protocol itself. Luckily, the damage was limited but it highlighted the issue- if the hackers encrypted valuable data my robocopy-based backup would happily overwrite the previous night backups with that, hence my sudden interest in more advanced backup solution.

Towards that end I set up local rest-server on a dedicated Linux machine (not VM) in append-only mode and trying to run restic off my ‘source’ computers on periodic basis. These tests brought to light a problem with one of my source HDDs - when I was trying to run a backup of one of its folders restic failed to read one or two files with cyclic redundancy code error. Since I had robocopied its content before I ran chkdsk /F /R and while it took more than 10 hrs to complete it seems it fixed the issue for now.
I’m moving data off that drive and trying to replace it ASAP.

As you can see it’s not that I don’t want to include the commands output, I simply failed to capture it in all that excitement.
If there’s interest to capture the details I can try to make copy of the data on that drive and then back them up to a different repo to possibly trigger the issue.

The backup command was in the batch file called backup.bat, manually executed from command window under local Administrator account on x64 Windows 2022 Server Standard, the folder was archive of family photos over 10+ years, total size about 200GB on 6TB drive. Things in square brackets had actual values, E: is the name of the local failing source HDD:

SET RESTIC_PASSWORD=[PASSWORD]
SET RESTIC_REPOSITORY=rest:https://[http_user]:[http_password]@[REST-HOST]:8000/photorepo
restic backup E:\Archive\Photos --cacert e:\Archive\Linux\Media\backup\public_key
restic forget --keep-within 90d --prune --verbose --cacert e:\Archive\Linux\Media\backup\public_key
restic check --cacert e:\Archive\Linux\Media\backup\public_key
SET RESTIC_PASSWORD=
SET RESTIC_REPOSITORY=

MichaelEischer · October 17, 2024, 7:59pm

A crc error while reading a file shouldn’t interrupt the backup. The backup command continues the backup even if some of the source files cannot be read. The only reason to fail a backup is when the upload to the repository fails. This can either be the upload itself or a problem while creating the next pack files to upload. Those pack files are written as temporary files to disk, which might have been the problem.

But without the actual output it’s unfortunately not possible to say more.

max2k · October 17, 2024, 10:14pm

Are you saying the pack files are written as temp files on the client before they get sent to the rest-server? Then yes, the failing source hdd might present a challenge. In that case I wonder if it would be possible to use some other drive for temp files location to increase chances of backing up files which are still readable but haven’t been processed yet during current backup run?

akrabu · October 18, 2024, 3:01am

export TMPDIR=/path/to/tmp
export RESTIC_CACHE_DIR=/path/to/tmp/restic

max2k · October 18, 2024, 4:57am

well, this is temp folder on the client to store cached data, not the actual pack files. If I’m not mistaken restic directly streams data to the backend so it shouldn’t be hard dependent on the failing source hdd. Monitoring snapshots on the backend machine proved to be quite efficient strategy, it already caught failing patch cable to the backend machine for example.

akrabu · October 18, 2024, 5:37am

Yes and no. Pack files are cached temporarily, and depending on how quickly they get pushed to the repository, they may or may not get written to disk. The pack size contributes to this, as larger packs are more likely to get flushed to disk.

As long as the failing drive is not where the $TMPDIR is located, it shouldn’t make things worse. But if it IS where the $TMPDIR is located, you risk writes being corrupted by the failing disk and also taxing the drive more than is necessary. You might consider --read-concurrency=1 to not tax the drive as much, as well.

Taken from the manual (namely the 2nd and 3rd paragraphs)

Pack Size

In certain instances, such as very large repositories (in the TiB range) or very fast upload connections, it is desirable to use larger pack sizes to reduce the number of files in the repository and improve upload performance. Notable examples are OpenStack Swift and some Google Drive Team accounts, where there are hard limits on the total number of files. Larger pack sizes can also improve the backup speed for a repository stored on a local HDD. This can be achieved by either using the --pack-size option or defining the $RESTIC_PACK_SIZE environment variable. Restic currently defaults to a 16 MiB pack size.

The side effect of increasing the pack size is requiring more disk space for temporary pack files created before uploading. The space must be available in the system default temp directory, unless overwritten by setting the $TMPDIR (except Windows) environment variable (on Windows use $TMP or $TEMP). In addition, depending on the backend the memory usage can also increase by a similar amount. Restic requires temporary space according to the pack size, multiplied by the number of backend connections plus one. For example, if the backend uses 5 connections (the default for most backends), with a target pack size of 64 MiB, you’ll need a minimum of 384 MiB of space in the temp directory. A bit of tuning may be required to strike a balance between resource usage at the backup client and the number of pack files in the repository.

Note that larger pack files increase the chance that the temporary pack files are written to disk. An operating system usually caches file write operations in memory and writes them to disk after a short delay. As larger pack files take longer to upload, this increases the chance of these files being written to disk. This can increase disk wear for SSDs.

TL;DR If it’s going to increase wear for SSDs, it’s going to increase wear on a failing HDD as well.

Ps. For actual drive recovery, I’d consider using ddrescue, not Restic. But to quickly grab just the data you need, without having enough space for a full clone, Restic is as good as anything else, really.

max2k · October 18, 2024, 6:24am

I see, thank you for clarification, it’s good to understand how things work.
Btw, I am not trying to use restic as data recovery tool. The story started from me trying to implement regular backup process and suddenly discovering that the source hdd was already failing and it unfolded from there. The disk didn’t fail completely allowing to migrate data to the replacement drive so the situation was not that bad at the end. Since that time I also discovered one of the patch cables on the way to the backend machine developed strange problem leading to drop in speed from 1Gbit/s to just 10Mbit/s. I wonder if that was the actual reason for original backup failure and not the CRC errors hdd produced. At least this is how I found out- restic failed to create next snapshot and transfer speeds were really low. If nothing else restic served as a good ‘stress’ tool for my environment bringing down components which were on their last legs anyway: I replaced 2 hdds which were quietly failing and 1 patch cable.

akrabu · October 18, 2024, 3:19pm

Restic has absolutely saved my bacon on a failing drive in the past - but it’s best used with a prior snapshot, instead of from scratch. If you already have a snapshot, then all it has to do is quickly read what’s changed - which is much less likely to push the disk over the edge than reading the whole drive. For that, it’s great. The other thing it excels at is computer stress tests, as you’ve now figured out hahaha. Restic is great at shining a light on disks, cables, CPUs, and memory that may have gone bad. Consider it the canary in the digital coal mine lol

Glad you caught it all in time and didn’t lose anything!

max2k · October 18, 2024, 7:52pm

Yeah, on the surface it might looked disastrous and frustrating as one of the failed disks was system SSD so I had to not only save data but also reinstall pretty much everything. On the other hand- I think I got really lucky as both drives were still working and if not for this restic implementation would continue dying quietly until that future not so sunny day when the system would fail to boot. At least I was able to repair everything in orderly fashion without my own stress. The SSD manufacturer (Samsung) even replaced it under warranty, turns out they had firmware bug leading to wearing out the same areas, otherwise it still had >98% life left in it. The patch cable was also starting to drive me nuts as it was on the path to the file server and the shares started to ‘misbehave’. At first I thought on some other components as Ethernet cables are usually way down the list of candidates until email about missing snapshot made it clear I have network issue. Hope these fixes would last for some years to come, I’ve never had so many things failing so close in time.