Fix broken snapshot

Hello,

restic check reports, that there is an error and I that should follow Troubleshooting — restic 0.17.3 documentation.

I did all steps (apart from step 2) but restic check still reports an error.

$ restic check --read-data
using temporary cache in /tmp/restic-check-cache-753204985
create exclusive lock for repository
repository d0b94d05 opened (version 2, compression level auto)
created new cache in /tmp/restic-check-cache-753204985
load indexes
[0:06] 100.00%  3 / 3 index files loaded
check all packs
7 additional files were found in the repo, which likely contain duplicate data.
This is non-critical, you can run `restic prune` to correct this.
check snapshots, trees and blobs
error: failed to load snapshot cce5e7a4: LoadRaw(<snapshot/cce5e7a420>): invalid data returned
[0:16] 100.00%  246 / 246 snapshots
read all data
[3:37:46] 100.00%  42806 / 42806 packs

The repository is damaged and must be repaired. Please follow the troubleshooting guide at https://restic.readthedocs.io/en/stable/077_troubleshooting.html .

Fatal: repository contains errors

As a full backup was already done (optional step 4) I am fine with just getting rid of the broken snapshot (assuming that it is one). How do I fix this repo?

Kind regards

Start by running restic forget cce5e7a4, or if that doesn’t work because restic is unable to load that snapshot, you can also just delete that snapshot’s file in the snapshots/ folder in the repository.

Any idea why the data in that snapshot is invalid? Something caused it, so you might want to treat this as something potentially bigger and debug it a bit before removing the snapshot.

1 Like

Please provide some information on your setup, that is restic version, storage backend etc.

There should be a file at snapshots/cc/cce5e7a420[...] in the repository. What does shasum -a256 snapshots/cc/cce5e7a420[...] return? Is there anything suspicious about that file?

1 Like

I already deleted the snapshot via rm. (And now everything works again.)

The broken repository is saved on my NAS with a RAID 1 with three drives. Two WD drives (different models) and one drive (of which I don’t know the manufacturer right now). The underlying fs is btrfs.

I push new snapshots using sftp.

But this repository is just a secondary one. Each (external) system pushes to a secondary repository. And then there is a primary repository (also on the NAS) pulling snapshots (via the copy command). This is done to mitigate ransomware attracts.

I recently had (actually still have) problems with the restic process being killed due too heavy memory usage. Maybe that had an impact on the corrupted secondary repository? The process performs copy, forget and prune on the secondary repositories. Currently I paused this process.

I am currently running restic check --read-data on the primary repository to verify that is not broken. Afterwards I am going to investigate the OOM problem.

You’ve again dodged the question regarding the used restic version. Recent versions mostly ensure that files are added atomically to the repository, so a file should either be fully uploaded or not. So an OOM issue shouldn’t result in damaged snapshots (unless maybe the NAS crashes).

OOM shouldn’t :tm: be able to damage the repository, but either way it’s still important to investigate.

Sry, I was on the way. I am using restic 0.17.3 (on all systems).

The restic check --read-data on my primary repository did not find any error when being executed on my laptop (using the sftp backend).

Then I tried to execute restic check on the NAS (local backend), but before I started the command I restarted the NAS to free memory:

$ export GOMEMLIMIT=1000MiB  # NAS only has 1,6 GB RAM.
$ ./restic check --verbose=2 -r /volume1/Restic-Main/restic-repository/ 
using temporary cache in /tmp/restic-check-cache-2925758990
create exclusive lock for repository
enter password for repository: 
repository 4708ce08 opened (version 2, compression level auto)
created new cache in /tmp/restic-check-cache-2925758990
load indexes
signal interrupt received, cleaning up
[17:49] 50.00%  1 / 2 index files loaded
$ free -m
              total        used        free      shared  buff/cache   available
Mem:           1644        1391          90           1         162          74
Swap:          2047        1975          72

After 13 minutes I cancelled the command and after 17 minutes it still says “cleaning up”. Then the process was killed.

created new cache in /tmp/restic-check-cache-2925758990
load indexes
signal interrupt received, cleaning up
Killed] 50.00%  1 / 2 index files loaded

What is wrong here?

Previously I used GOMEMLIMIT=800MiB for my backup script which executes a backup of the files on the NAS. Execute copy from the secondary repositories into the main repository. And forget --prune on the secondary and primary repository. You can see my script here: #!/bin/shMAIN_REPO="/volume1/Restic-Main/restic-repository/"MAIN_PASSWORD= - Pastebin.com I redacted paths, hosts (restic forget) and passwords. I also removed restic unrelated stuff (https://healthchecks.io/ pings etc).

And I was wondering… Why are there only two index files? The much smaller secondary repository has 164 index files.

That NAS is absolutely overloaded. With that much swapped out memory it’s no surprise that simple commands can take ages.

With 2 index files, restic shouldn’t require much more than 200MB RAM. So I’m not exactly sure what’s going on. When exactly did you run free -m? Before, during or after check?

How large is the repository? Did you maybe run prune on one but not the other? forgetMainRepo in your script doesn’t include --prune.

P.S. There’s no need for all those export statements in the script. For example, RESTIC_REPOSITORY="${MAIN_REPO}" RESTIC_PASSWORD="${MAIN_PASSWORD}" ./restic would only set those environment variable for that specific call.

The repo has a size of 2.053 TiB, contains 8,373,840 blobs, 127,647 packs and 1062 snapshots. I added the prune to my monthly script (which checks the whole repo and now prunes as well :slight_smile: ). Thanks for the hint. Btw, I just did a prune:

# different host, accessing NAS via sftp backend.
$ restic prune --repack-small
repository 4708ce08 opened (version 2, compression level auto)
loading indexes...
[0:21] 100.00%  2 / 2 index files loaded
loading all snapshots...
finding data that is still in use for 1062 snapshots
[1:35] 100.00%  1062 / 1062 snapshots
searching used packs...
collecting packs for deletion and repacking
[1:20] 100.00%  127647 / 127647 packs processed

to repack:             0 blobs / 0 B
this removes:          0 blobs / 0 B
to delete:             0 blobs / 0 B
total prune:           0 blobs / 0 B
remaining:       8373840 blobs / 2.053 TiB
unused size after prune: 104.981 GiB (4.99% of remaining size)

done

I restarted the NAS again and executed the following command:

$ export RESTIC_CACHE_DIR=/volume1/Restic-Main/.cache/restic  # to reduce CPU usage (synology is notorious for burning lots of cpu for indexing files -> get the cache out of the home folder)
$ export GOMEMLIMIT=800MiB
$ ./restic check --verbose=2 -r /volume1/Restic-Main/restic-repository/
using temporary cache in /volume1/Restic-Main/.cache/restic/restic-check-cache-610377298
create exclusive lock for repository
enter password for repository: 
repository 4708ce08 opened (version 2, compression level auto)
created new cache in /volume1/Restic-Main/.cache/restic/restic-check-cache-610377298
load indexes
signal interrupt received, cleaning up
[6:56] 50.00%  1 / 2 index files loaded

The RESTIC_CACHE_DIR and GOMEMLIMIT env vars are those I use in my backup script as well.

Here the result of free:

$ free   # before start
              total        used        free      shared  buff/cache   available
Mem:        1683792      585632      183252      138916      914908      720404
Swap:       2097084       15180     2081904
$ free -m  # after 15s
              total        used        free      shared  buff/cache   available
Mem:           1644        1226          88          67         328         139
Swap:          2047         301        1746
$ free -m  # after 45s
              total        used        free      shared  buff/cache   available
Mem:           1644        1381          80          37         182          55
Swap:          2047         654        1393
$ free -m  # after 60s
              total        used        free      shared  buff/cache   available
Mem:           1644        1345         144           1         154         124
Swap:          2047         916        1131
$ free -m  # after 120s
              total        used        free      shared  buff/cache   available
Mem:           1644        1339          97           1         207         103
Swap:          2047        1130         917
$ free -m  # after 4min
              total        used        free      shared  buff/cache   available
Mem:           1644        1345          89           1         208          96
Swap:          2047        1103         944
$ free -m  # after 6min
              total        used        free      shared  buff/cache   available
Mem:           1644        1356          89           1         198          90
Swap:          2047        1083         964

I know :slight_smile:

When I cancel the check command, will cleaning up also take memory? Because cleaning up now already runs for 10 minutes again (and is still not finished).

Any idea what I could try?

Thank you for your help :slight_smile:

This combination is completely unexpected. Restic aims for indexes with 50k or 150k blobs (depending on the restic version) in them. A pack file can also contain at most 420k blobs (this is a hard limit that cannot be bypassed). So ending up with 2 index files should be impossible.

The current rule of thumb is 1GB RAM per 7 million blobs. With a very aggressive GC setting this can be reduced to roughly 1GB RAM per 10 million blobs. However, this does not account for oversized index files. Those at least double the absolute minimum required memory to > 1.5GB.

Please take a look at when those index files in the repository were created. Do you know which command was running at that point? How large are the index files?

Yes, restic is likely stuck while trying to decompress the index, which cannot be interrupted. When exceeding the memory limit set by GOMEMLIMIT this results in a massive slowdown.

1 Like

You can run restic repair index --read-all-packs from a host with enough memory to get the index back into a reasonable shape. This will however take quite a while. But please create a backup copy of the old index first to keep the chance to investigate what has happened.

[Edit]I don’t see how this case could even occur in recent restic versions. All code paths that save an index are guarded by index size checks.[/Edit]

[Edit 2]Out of curiosity. Could you run restic stats --mode debug to get information about the size distribution of packs and blobs to see whether there’s something unusual there?[/Edit]

-r--r-----  1 localuser        users 1.4M Feb  1 21:38 27ab1c2de61e66fc995fbc1d49b75203498ac2b3029473711b7237765616144f
-r--r-----  1 localuser        users 2.0M Feb  1 21:23 44a99183b19d3cc037ef0df190e9c570c624526695ea3dfacaa977d7a4d5dca6
-rwxrwxrwx+ 1 backupremoteuser users 388M Jan 21 09:55 5d8bb8041da7cc4696da38f478db899d0f63b83c9f9e3bd4a5b5cde78fa40d25
-rwxrwxrwx+ 1 backupremoteuser users  87K Jan 28 19:11 61cc3b4ca0ef9d6055cfc58c7e0e55dacd3d0ef9ca75edca312b04aae8618533
-r--r-----  1 localuser        users  61K Feb  1 21:38 c68f478a5e9186f0ebe5a4bf90edd98f92ef94207c1feda0247b505516244657

No, I cannot tell what I executed at the point. I know that I already tried to repair the repository at some point.

localuser is the user which runs the backup script on the NAS. backupremoteuser is the one, which is used to access the repository for manual “usage” (e.g. the restic repair index --read-all-packs you requested me to execute). Whats causing the permissions to be different + lax?

I now moved the index folder to a new location and (afterwards) executed restic repair index --read-all-packs. ls -lh of the newly created index:

$ ls -lh
total 392M
-rwxrwxrwx+ 1 backupremoteuser users 392M Feb  7 19:05 f21ed3d5f2ac7b4c0b98a6b7c198038ec2d09cfd4aee2243588216ddb55a56f9

The result of restic stats --mode debug after repairing the index:

$ restic stats --mode debug
...
[0:21] 100.00%  1 / 1 index files loaded
Collecting size statistics

File Type: key
Count: 2
Total Size: 907 B
Size            Count
---------------------
100 - 999 Byte  2
---------------------

File Type: lock
Count: 1
Total Size: 157 B
Size            Count
---------------------
100 - 999 Byte  1
---------------------

File Type: index
Count: 1
Total Size: 391.507 MiB
Size  Count
-----------
-----------
Oversized: [410524351]

File Type: data
Count: 127800
Total Size: 2.055 TiB
Size                      Count
--------------------------------
      10000 - 99999 Byte  1
    100000 - 999999 Byte  1
  1000000 - 9999999 Byte  1
10000000 - 99999999 Byte  127797
--------------------------------

Blob Type: data
Count: 6315633
Total Size: 2.052 TiB
Size                    Count
-------------------------------
          10 - 99 Byte  59370
        100 - 999 Byte  1241944
      1000 - 9999 Byte  1965925
    10000 - 99999 Byte  849867
  100000 - 999999 Byte  1403521
1000000 - 9999999 Byte  795006
-------------------------------


Blob Type: tree
Count: 2134060
Total Size: 2.815 GiB
Size                    Count
-------------------------------
          10 - 99 Byte  1
        100 - 999 Byte  1788100
      1000 - 9999 Byte  313207
    10000 - 99999 Byte  29984
  100000 - 999999 Byte  2679
1000000 - 9999999 Byte  89
-------------------------------

I hope you have any idea whats wrong here… Thank you for looking at this @MichaelEischer

I guess that’s caused by whatever protocol you used to mount the repository on the remote host. The local, sftp and rest backend are rather aggressive at restricting the file permissions. If you’re using SFTP then something on the NAS is silently swallowing the chmod call issued by restic.

:scream_cat: That wasn’t supposed to happen. The expected result would have been lots of 4-8MB files. What restic build, version, OS and architecture did you use exactly (at least the full output of restic version)? The code of restic repair index --read-all-packs is compact enough to allow finding the problem. But I need to know exactly which code version to look at.

P.S.: The output of the debug stats look perfectly normal except for the oversized index.

I just took another look at the code and still don’t have the slightest clue how such an oversized index can be created. Please try the following. Download a fresh, prebuilt copy of restic 0.17.3 from Github. Just use a recent restic version. Then rebuild the index as follows. Navigate into the data/00 folder of your repository and move one file into the repository root folder next to the config (you must KEEP that file!). Then run restic repair index. That should complain about the missing file and rewrite the index. Then move the file back and repair the index again. Now the data file should be added again to the index.

[Edit]The oversized index is probably related to using restic repair index --read-all-packs I’ll dig into it further the next days. The above steps should nevertheless work to get the index back into shape[/Edit]

1 Like

restic 0.17.3 compiled with go1.23.3 on linux/amd64

Is the index-bak of any value for you or can I delete it? Anything else you need to track down this bug?

That helped. My backup script now works again :partying_face: Thank you very much :+1: :+1: :+1:

I think I have all the information I need. Knowing that the workaround did work confirms my suspicion. Thanks! So you can delete the index backup.

Will be fixed by Prevent creation of oversized indexes and automatically rewrite them. by MichaelEischer · Pull Request #5249 · restic/restic · GitHub.

2 Likes

Thank you very much for your help and your involvement in restic :heart: