I did all steps (apart from step 2) but restic check still reports an error.
$ restic check --read-data
using temporary cache in /tmp/restic-check-cache-753204985
create exclusive lock for repository
repository d0b94d05 opened (version 2, compression level auto)
created new cache in /tmp/restic-check-cache-753204985
load indexes
[0:06] 100.00% 3 / 3 index files loaded
check all packs
7 additional files were found in the repo, which likely contain duplicate data.
This is non-critical, you can run `restic prune` to correct this.
check snapshots, trees and blobs
error: failed to load snapshot cce5e7a4: LoadRaw(<snapshot/cce5e7a420>): invalid data returned
[0:16] 100.00% 246 / 246 snapshots
read all data
[3:37:46] 100.00% 42806 / 42806 packs
The repository is damaged and must be repaired. Please follow the troubleshooting guide at https://restic.readthedocs.io/en/stable/077_troubleshooting.html .
Fatal: repository contains errors
As a full backup was already done (optional step 4) I am fine with just getting rid of the broken snapshot (assuming that it is one). How do I fix this repo?
Start by running restic forget cce5e7a4, or if that doesn’t work because restic is unable to load that snapshot, you can also just delete that snapshot’s file in the snapshots/ folder in the repository.
Any idea why the data in that snapshot is invalid? Something caused it, so you might want to treat this as something potentially bigger and debug it a bit before removing the snapshot.
Please provide some information on your setup, that is restic version, storage backend etc.
There should be a file at snapshots/cc/cce5e7a420[...] in the repository. What does shasum -a256 snapshots/cc/cce5e7a420[...] return? Is there anything suspicious about that file?
I already deleted the snapshot via rm. (And now everything works again.)
The broken repository is saved on my NAS with a RAID 1 with three drives. Two WD drives (different models) and one drive (of which I don’t know the manufacturer right now). The underlying fs is btrfs.
I push new snapshots using sftp.
But this repository is just a secondary one. Each (external) system pushes to a secondary repository. And then there is a primary repository (also on the NAS) pulling snapshots (via the copy command). This is done to mitigate ransomware attracts.
I recently had (actually still have) problems with the restic process being killed due too heavy memory usage. Maybe that had an impact on the corrupted secondary repository? The process performs copy, forget and prune on the secondary repositories. Currently I paused this process.
I am currently running restic check --read-data on the primary repository to verify that is not broken. Afterwards I am going to investigate the OOM problem.
You’ve again dodged the question regarding the used restic version. Recent versions mostly ensure that files are added atomically to the repository, so a file should either be fully uploaded or not. So an OOM issue shouldn’t result in damaged snapshots (unless maybe the NAS crashes).
OOM shouldn’t be able to damage the repository, but either way it’s still important to investigate.
Sry, I was on the way. I am using restic 0.17.3 (on all systems).
The restic check --read-data on my primary repository did not find any error when being executed on my laptop (using the sftp backend).
Then I tried to execute restic check on the NAS (local backend), but before I started the command I restarted the NAS to free memory:
$ export GOMEMLIMIT=1000MiB # NAS only has 1,6 GB RAM.
$ ./restic check --verbose=2 -r /volume1/Restic-Main/restic-repository/
using temporary cache in /tmp/restic-check-cache-2925758990
create exclusive lock for repository
enter password for repository:
repository 4708ce08 opened (version 2, compression level auto)
created new cache in /tmp/restic-check-cache-2925758990
load indexes
signal interrupt received, cleaning up
[17:49] 50.00% 1 / 2 index files loaded
$ free -m
total used free shared buff/cache available
Mem: 1644 1391 90 1 162 74
Swap: 2047 1975 72
After 13 minutes I cancelled the command and after 17 minutes it still says “cleaning up”. Then the process was killed.
created new cache in /tmp/restic-check-cache-2925758990
load indexes
signal interrupt received, cleaning up
Killed] 50.00% 1 / 2 index files loaded
What is wrong here?
Previously I used GOMEMLIMIT=800MiB for my backup script which executes a backup of the files on the NAS. Execute copy from the secondary repositories into the main repository. And forget --prune on the secondary and primary repository. You can see my script here: #!/bin/shMAIN_REPO="/volume1/Restic-Main/restic-repository/"MAIN_PASSWORD= - Pastebin.com I redacted paths, hosts (restic forget) and passwords. I also removed restic unrelated stuff (https://healthchecks.io/ pings etc).
And I was wondering… Why are there only two index files? The much smaller secondary repository has 164 index files.
That NAS is absolutely overloaded. With that much swapped out memory it’s no surprise that simple commands can take ages.
With 2 index files, restic shouldn’t require much more than 200MB RAM. So I’m not exactly sure what’s going on. When exactly did you run free -m? Before, during or after check?
How large is the repository? Did you maybe run prune on one but not the other? forgetMainRepo in your script doesn’t include --prune.
P.S. There’s no need for all those export statements in the script. For example, RESTIC_REPOSITORY="${MAIN_REPO}" RESTIC_PASSWORD="${MAIN_PASSWORD}" ./restic would only set those environment variable for that specific call.
The repo has a size of 2.053 TiB, contains 8,373,840 blobs, 127,647 packs and 1062 snapshots. I added the prune to my monthly script (which checks the whole repo and now prunes as well ). Thanks for the hint. Btw, I just did a prune:
# different host, accessing NAS via sftp backend.
$ restic prune --repack-small
repository 4708ce08 opened (version 2, compression level auto)
loading indexes...
[0:21] 100.00% 2 / 2 index files loaded
loading all snapshots...
finding data that is still in use for 1062 snapshots
[1:35] 100.00% 1062 / 1062 snapshots
searching used packs...
collecting packs for deletion and repacking
[1:20] 100.00% 127647 / 127647 packs processed
to repack: 0 blobs / 0 B
this removes: 0 blobs / 0 B
to delete: 0 blobs / 0 B
total prune: 0 blobs / 0 B
remaining: 8373840 blobs / 2.053 TiB
unused size after prune: 104.981 GiB (4.99% of remaining size)
done
I restarted the NAS again and executed the following command:
$ export RESTIC_CACHE_DIR=/volume1/Restic-Main/.cache/restic # to reduce CPU usage (synology is notorious for burning lots of cpu for indexing files -> get the cache out of the home folder)
$ export GOMEMLIMIT=800MiB
$ ./restic check --verbose=2 -r /volume1/Restic-Main/restic-repository/
using temporary cache in /volume1/Restic-Main/.cache/restic/restic-check-cache-610377298
create exclusive lock for repository
enter password for repository:
repository 4708ce08 opened (version 2, compression level auto)
created new cache in /volume1/Restic-Main/.cache/restic/restic-check-cache-610377298
load indexes
signal interrupt received, cleaning up
[6:56] 50.00% 1 / 2 index files loaded
The RESTIC_CACHE_DIR and GOMEMLIMIT env vars are those I use in my backup script as well.
Here the result of free:
$ free # before start
total used free shared buff/cache available
Mem: 1683792 585632 183252 138916 914908 720404
Swap: 2097084 15180 2081904
$ free -m # after 15s
total used free shared buff/cache available
Mem: 1644 1226 88 67 328 139
Swap: 2047 301 1746
$ free -m # after 45s
total used free shared buff/cache available
Mem: 1644 1381 80 37 182 55
Swap: 2047 654 1393
$ free -m # after 60s
total used free shared buff/cache available
Mem: 1644 1345 144 1 154 124
Swap: 2047 916 1131
$ free -m # after 120s
total used free shared buff/cache available
Mem: 1644 1339 97 1 207 103
Swap: 2047 1130 917
$ free -m # after 4min
total used free shared buff/cache available
Mem: 1644 1345 89 1 208 96
Swap: 2047 1103 944
$ free -m # after 6min
total used free shared buff/cache available
Mem: 1644 1356 89 1 198 90
Swap: 2047 1083 964
I know
When I cancel the check command, will cleaning up also take memory? Because cleaning up now already runs for 10 minutes again (and is still not finished).
This combination is completely unexpected. Restic aims for indexes with 50k or 150k blobs (depending on the restic version) in them. A pack file can also contain at most 420k blobs (this is a hard limit that cannot be bypassed). So ending up with 2 index files should be impossible.
The current rule of thumb is 1GB RAM per 7 million blobs. With a very aggressive GC setting this can be reduced to roughly 1GB RAM per 10 million blobs. However, this does not account for oversized index files. Those at least double the absolute minimum required memory to > 1.5GB.
Please take a look at when those index files in the repository were created. Do you know which command was running at that point? How large are the index files?
Yes, restic is likely stuck while trying to decompress the index, which cannot be interrupted. When exceeding the memory limit set by GOMEMLIMIT this results in a massive slowdown.
You can run restic repair index --read-all-packs from a host with enough memory to get the index back into a reasonable shape. This will however take quite a while. But please create a backup copy of the old index first to keep the chance to investigate what has happened.
[Edit]I don’t see how this case could even occur in recent restic versions. All code paths that save an index are guarded by index size checks.[/Edit]
[Edit 2]Out of curiosity. Could you run restic stats --mode debug to get information about the size distribution of packs and blobs to see whether there’s something unusual there?[/Edit]