Error "tree ... is not known"

During a backup run Restic reported an error about “tree is not known”. Restic then created a snapshot that apparently did not use any deduplication with previous data, i.e. it added the whole data again to the repository.

I’m wondering why this happened, and how I can prevent this in the future. If this happens over a weekend, I might run out of space on the backup storage before I notice the problem on Monday :slight_smile: .

This was the backup command:

docker run \
    --rm \
    --volume "$SCRIPT_LOCATION:/var/lib/restic_config_volume:ro" \
    --volume "$BACKUP_REPOSITORY_PATH:/var/lib/restic" \
    --volume my-volume:/var/lib/postgresql/data:ro \
    -e RESTIC_REPOSITORY=/var/lib/restic \
    -e RESTIC_PASSWORD_FILE=/var/lib/restic_config_volume/restic_backup_password \
    -e RESTIC_PROGRESS_FPS=0.0016666666666666668 \
    mazzolino/restic:1.7.1 \
    backup \
    --no-cache \
    --host "${HOSTNAME}" \
    --one-file-system \
    --exclude-caches \
    --compression=max \
    --tag "machine_config_id_${MACHINE_CONFIG_ID}" \
    /var/lib/postgresql/data/

This was the error message:

Feb 06 02:34:28 RBC-SRV script_executor.sh[3855649]: using parent snapshot 6bb05132
Feb 06 02:34:28 RBC-SRV script_executor.sh[3855649]: error: tree 188e5321a07f3a0fba5c8f2e5ec915c1ab86753ac2bac3c84e1b72e1922de75b is not known; the repository could be damaged, run `repair index` to try to repair it
Feb 06 02:44:27 RBC-SRV script_executor.sh[3855649]: [10:00] 4.27%  691 files 47.929 GiB, total 2211 files 1.097 TiB, 1 errors ETA 3:56:02

and this was the summary after backup was finished:

Feb 06 06:03:30 RBC-SRV script_executor.sh[3855649]: Files:        2211 new,     0 changed,     0 unmodified
Feb 06 06:03:30 RBC-SRV script_executor.sh[3855649]: Dirs:           30 new,     0 changed,     0 unmodified
Feb 06 06:03:30 RBC-SRV script_executor.sh[3855649]: Added to the repository: 1.097 TiB (831.236 GiB stored)
Feb 06 06:03:30 RBC-SRV script_executor.sh[3855649]: 
Feb 06 06:03:30 RBC-SRV script_executor.sh[3855649]: processed 2211 files, 1.097 TiB in 3:29:03
Feb 06 06:03:30 RBC-SRV script_executor.sh[3855649]: snapshot cdb8f495 saved
Feb 06 06:03:30 RBC-SRV script_executor.sh[3855649]: Warning: at least one source file could not be read

I then ran restic check:

# restic --no-lock     --no-cache     check
repository 7a39ad76 opened (version 2, compression level auto)
load indexes
[0:04] 100.00%  49 / 49 index files loaded
check all packs
pack 7aed4355fedaf782bd3fb055403a5a12b6b5cb346faba4e948bce6f7e50cafcf: not referenced in any index
[...]
744 additional files were found in the repo, which likely contain duplicate data.
This is non-critical, you can run `restic prune` to correct this.
check snapshots, trees and blobs
[0:37] 100.00%  9 / 9 snapshots
no errors were found

Based on the suggestions from Tree could not be loaded - #2 by injerto I searched for the tree object:

09858093d70b:/# restic --no-lock --no-cache find --show-pack-id --tree 188e5321a07f3a0fba5c8f2e5ec915c1ab86753ac2bac3c84e1b72e1922de75b
repository 7a39ad76 opened (version 2, compression level auto)
[0:03] 100.00%  54 / 54 index files loaded
Object belongs to pack 9899022a53e02206dd8cb0aa6dece433056a38b354e4de5ddbd07f9336643464
 ... Pack 9899022a: <Blob (tree) 188e5321, offset 19286, length 261, uncompressed length 330>

09858093d70b:/# restic --no-lock --no-cache cat pack 9899022a53e02206dd8cb0aa6dece433056a38b354e4de5ddbd07f9336643464 | wc
       76       423     20198

But I don’t know what this tells me.

The backups since then have worked fine. I did not run prune or repair index yet; that would be my next steps.

This happened with Restic 0.16.2 (restic 0.16.2 compiled with go1.21.3 on linux/amd64) from the “mazzolino/restic” Docker image, version 1.7.1, under Ubuntu 20.04 .

The Restic repository is located on a Samba share. I’m not aware of problems with this particular share, but I could imagine that there might be temporary problems with network drives?

There should be no cache involved (I use --no-cache when doing the backup).

So, my questions are:

  • is there anything I can do to prevent this problem in the future?
  • do you want to debug this problem further, in case it’s an unknown bug?
  • do you have suggestions how to automatically repair this? I suppose if I automatically run prune after doing the backup, it would throw away the duplicate data (i.e. there would be no space wasted, only CPU time)?

Btw. thanks a lot for creating Restic! I use it at home and at work, for several TBs of data by now, and it’s so nice that it just solves the backup problem.

Welcome here @oliver !

The only side effect of unknown parent tree is that restic must read all your data locally. However, it only uploads data which is not in the repository already.

As check did not report an error there’s no need to run repair index. Although, I’d recommend to run restic check --read-data to verify the whole repository (it looks like a local backup such that there’s no cost to reading the whole repository).

Does that backup the raw postgres data folder? If yes, postgres MUST be stopped during the backup or the backup will be corrupt. Alternatives are explicitly creating a filesystem snapshot or backing up a db dump.

That looks like the backup showed more errors / relevant log messages in addition to the error you’ve already posted. In particular, there should be a line regarding whether a parent snapshot was found and which one. But, as already mentioned, deduplication works independently of that.

DBs tend to make lots of small modifications across a large part of the data storage. Restic uses a chunk size of on average 1MB such that there’s a high probability of such changes affecting most chunks. That might explain why a log of new data was added during the backup.

The tree error might be explained by restic not seeing one file in the index folder of the repository. Is the restic repository only used by that host? Can multiple processes run in parallel?

If some indexes were not visible to restic, that could also cause uploading duplicate data. Although, I don’t know the usual change rate, so it’s hard to estimate whether that would have to affect one or multiple files.

Which OS / Linux version do you run? Ubuntu 20.04 can be pretty much anything in regards to kernel version. We had problem with data corruption over Samba a few years ago (see restic fails on repo mounted via CIFS/samba on Linux using go 1.14 build · Issue #2659 · restic/restic · GitHub. My assumption is that kernels versions starting from 2022 should be fine.)

If a tree blob were actually missing, then prune would refuse to modify the repository. prune will remove the duplicate data. To see how much duplication there is pass the -v option to prune.

Well, that essentially only says that the tree does exist. So the question becomes “why the restic backup did not find that tree” in contrast to “why is the tree missing”.

3 Likes

Good idea, I started restic --no-lock --no-cache check --read-data now, but it’s probably going to take about 6 hours to finish. I use --no-lock here because the repository is mounted read-only, but I’m certain that there are no other users of this repository at the moment.

Yes, the Postgres Docker container is shut down and removed before the backup is started. I’m not happy with this entire approach (due to the downtime, and due to the size of incremental backups); but so far I haven’t encountered a really good solution.

There was a log message about parent snapshot right before the “tree not known” error message:

Feb 06 02:34:28 RBC-SRV script_executor.sh[3855649]: using parent snapshot 6bb05132

And this sounds sensible – 6bb05132 is the id of the snapshot that had been created the night before.

I think my initial post included all Restic log messages, apart from lots of progress messages like these:

Feb 06 02:44:27 RBC-SRV script_executor.sh[3855649]: [10:00] 4.27%  691 files 47.929 GiB, total 2211 files 1.097 TiB, 1 errors ETA 3:56:02
Feb 06 02:44:27 RBC-SRV script_executor.sh[3855649]: /var/lib/postgresql/data/base/13445/16457.104
Feb 06 02:44:27 RBC-SRV script_executor.sh[3855649]: /var/lib/postgresql/data/base/13445/16457.1040
Feb 06 02:54:27 RBC-SRV script_executor.sh[3855649]: [20:00] 9.17%  746 files 102.920 GiB, total 2211 files 1.097 TiB, 1 errors ETA 3:00:07
Feb 06 02:54:27 RBC-SRV script_executor.sh[3855649]: /var/lib/postgresql/data/base/13445/16457.109
Feb 06 02:54:27 RBC-SRV script_executor.sh[3855649]: /var/lib/postgresql/data/base/13445/16457.1090

Well, there haven’t been any modifications to the data in the DB in the past weeks, and the nightly backup always logged exactly this summary: “Added to the repository: 63.103 MiB (32.743 MiB stored)”.
Only on this one night it jumped to “Added to the repository: 1.097 TiB (831.236 GiB stored)”, and in the night afterwards it went to “Added to the repository: 1.016 KiB (891 B stored)”.

So to me it does sound a bit like Restic did a full backup again, and did not deduplicate in this one case.

This repository should only be accessed by this one Restic backup process.
There are other repos on the same network share, though, which are accessed at the same time. And there are periodic checks by a monitoring tool to inspect the number and age of files in snapshots/ and locks/ directories (but these checks are done on file level, not through a Restic process).

Oh, right:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.1 LTS
Release:        20.04
Codename:       focal

$ uname -a
Linux RBC-SRV 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

That issue sounds quite interesting! I have now checked syslog (should have thought of that before), and it shows a kernel error right after the Restic error:

Feb 06 02:34:28 RBC-SRV script_executor.sh[3855649]: using parent snapshot 6bb05132
Feb 06 02:34:28 RBC-SRV script_executor.sh[3855649]: error: tree 188e5321a07f3a0fba5c8f2e5ec915c1ab86753ac2bac3c84e1b72e1922de75b is not known; the repository could be damaged, run `repair index` to try to repair it
Feb 06 02:34:30 RBC-SRV kernel: CIFS: VFS: \\192.168.26.8 No task to wake, unknown frame received! NumMids 3

(followed by a packet dump).

I don’t know if that CIFS error message is really related to the Restic error. But it sounds reasonable that there was a problem with retries/EINTR when reading from CIFS share, which could have caused Restic to not see the entire repo contents, leading to the full backup.

I have now also noticed the warning at https://restic.readthedocs.io/en/latest/030_preparing_a_new_repo.html#local regarding CIFS mounts (duh!), and will add -e GODEBUG=asyncpreemptoff=1 to the Docker command line. And I have now searched for “Warning” in the entire Restic documentation :slight_smile: and checked that none of the ten matches apply to my setup.

I’ll keep an eye on this, but I hope that with the GODEBUG workaround it is fixed now. Many thanks for your help! Sorry for not noticing the warning about CIFS before.

You’re aware that the server hasn’t seen any security updates and bugfixes in 4 years? At that state the server shouldn’t be connected to a network anymore. The CIFS error might have been fixed in the meantime.

--no-lock is ignored by the check command.

I didn’t look too closely at the timestamps of the “tree not know” error before. It was printed right at the start so it’s possible that the index folder appeared empty to restic. That could be a side-effect of buggy handling of Go’s async preemption in older kernel versions. So if check is happy, then your repository should be fine and with the GODEBUG workaround hopefully stay that way.

1 Like