Backing up failing disk

Hello,

I’m trying to set up back up solution using restic for personal use. In the process of backing up a folder I found the source hard drive developed several bad blocks. Restic failed backup with message something about cyclic redundancy code error. Is this expected for the restic to quit backup in such situation or it’s possible to make it finish backing up what is possible skipping non-readable files in the process? It failed after encountering second corrupted file out of more than a million.

It appears to me this situation would be fairly common as backups become most valuable precisely when such problems occur. In that case letting it run to the end and possibly back up changes since last snapshot would probably be more useful than losing those changes altogether.

restic 0.17.1 compiled with go1.23.1 on windows/amd64

You did not provide the information you were asked for when you opened the editor for your post. Please complement it by adding the requested information (to see what it is, you can open the edit form for a new topic in the Getting Help category). Such technical information is needed to even know what the symptoms you are talking about looks like, people cannot make an assessment without seeing what you are seeing.

I think I didn’t ask the question correctly: I don’t need help with backup troubleshooting (yet), the question was about overall restic approach when it comes to backing up data located on a drive that is starting to fail.

It appears to me restic would fail fast in this case and correspondent snapshot will not be created, this is at least what I observed. There’s no ‘partial’ snapshot concept in restic- it either creates a full one or none. Consequently, the files which are still readable on the source disk will not be backed up and their changes since the last successful snapshot will be lost when the drive completely fails eventually. Is my understanding correct?

Would this mean to minimize the loss one should create snapshots more often and have some notification in place to alert when restic fails creating the next snapshot?

Technically a drive that starts to fail can create conditions that prevents restic from operating properly (outside of restic’s control) and conditions where restic is able to ignore a file that cannot be read and inform the user about this. Generally speaking restic tries to back up what it can.

How and when restic fails/bails/cancels due to <some issue> depends on the specific issue. Snapshots are created at the very end of the backup process, but data that restic was able to read and upload to the repository will not have to be uploaded again on the next backup.

If you never managed to create a snapshot, you can see if the recover command can help you get back what has been stored in the repository (see restic help recover).

Well, If you have a disk that starts misbehaving to the extent that restic fails to complete its backup, then yes, the changed/new/updated data for the files that it had not processed and uploaded to the repository will not be saved to the repository, and hence not be restorable either.

For the files that it did process before being interrupted, they will not have a corresponding snapshot, so you’d have to use the recover command to get their data back. However, in a situation like this, the normal course is of course to act immediately if your backups don’t run, and consider the last successful snapshot be the one that is your restore point for the data at hand.

In certain cases you would be able to re-run the backup but with problematic files/folders excluded, to make restic back up the other ones, in case it has problems continuing the backup process when encountering the problematic files (again, all of this depends on what the actual symptoms are).

You have to work off the assumption that your infrastructure is working as it should, and that when it doesn’t, you have the last backup you ran be the most recent restore point. Any data between that last restore point and “now” is just up to you to either back up/salvage, or just throw away. There are various ways to deal with that, e.g. by excluding or backing up specific files with restic, or just salvaging what you can off the failing disk if it is in bad shape. It all boils down to the actual symptoms in that specific situation.

Personally on the machines where I don’t just do manual backup runs after I changed some relevant data, I have hourly backup runs.

Regarding notifications, yes, you should of course have some kind of eye on or monitoring of the backup processes, to make sure that you detect when they don’t complete successfully :slight_smile:


I don’t understand why you don’t want to include the complete commands you ran, any relevant env vars for them, and their complete output including error messages (although deduplicated in a sense-making way). Those CRC errors would be interesting to see.

Thanks a lot for the very detailed answer, it helps a lot.

I’ve been doing robocopy- based backups for years and they were sufficient for my needs until recent event where one of my computers was hacked over RDP protocol and its drives were encrypted. It seems they didn’t guess credentials correctly but rather hacked the protocol itself. Luckily, the damage was limited but it highlighted the issue- if the hackers encrypted valuable data my robocopy-based backup would happily overwrite the previous night backups with that, hence my sudden interest in more advanced backup solution.

Towards that end I set up local rest-server on a dedicated Linux machine (not VM) in append-only mode and trying to run restic off my ‘source’ computers on periodic basis. These tests brought to light a problem with one of my source HDDs - when I was trying to run a backup of one of its folders restic failed to read one or two files with cyclic redundancy code error. Since I had robocopied its content before I ran chkdsk /F /R and while it took more than 10 hrs to complete it seems it fixed the issue for now.
I’m moving data off that drive and trying to replace it ASAP.

As you can see it’s not that I don’t want to include the commands output, I simply failed to capture it in all that excitement.
If there’s interest to capture the details I can try to make copy of the data on that drive and then back them up to a different repo to possibly trigger the issue.

The backup command was in the batch file called backup.bat, manually executed from command window under local Administrator account on x64 Windows 2022 Server Standard, the folder was archive of family photos over 10+ years, total size about 200GB on 6TB drive. Things in square brackets had actual values, E: is the name of the local failing source HDD:

SET RESTIC_PASSWORD=[PASSWORD]
SET RESTIC_REPOSITORY=rest:https://[http_user]:[http_password]@[REST-HOST]:8000/photorepo
restic backup E:\Archive\Photos --cacert e:\Archive\Linux\Media\backup\public_key
restic forget --keep-within 90d --prune --verbose --cacert e:\Archive\Linux\Media\backup\public_key
restic check --cacert e:\Archive\Linux\Media\backup\public_key
SET RESTIC_PASSWORD=
SET RESTIC_REPOSITORY=