Snapshots amount mismatch in mirrored repositories

teran · July 31, 2023, 12:52am

Hello,

I have the following repositories topology:

O (local) - repository contains snapshots with specific tag
L1 (restic S3) - repository contains snapshots from specific machine
L2 (restic S3) - repository contains snapshots from another similar machine
V (local) - repository which all the snapshots from repositories above are copied to
R1 - R3 (rclone) - repositories in different cloud storages which are snapshots from V are copied to

So here’s a kind of chain of snapshots.

Today I’ve noticed strange thing:
V contains 9950 snapshots
R1 - 9961
R2 - 9969
R3 - 10160

Results above are gathered with restic --no-cache ... snapshots

All the repositories are periodically checked with --read-data and the last check was about a week ago without any issues.

copy from V to R1, R2, R3 is passed fine with no snapshots copied; R1 → V; R2 → V; R3 → V - the same (also with --no-cache).

So how this could be even possible?

One more interesting thing found: check subcommand reports it can’t access cache directory while other subcommands are creating it.

unable to create temporary directory for cache during check, disabling cache: stat /Volumes/Flash/Temp/Cache/restic/repo: no such file or directory

MichaelEischer · August 5, 2023, 10:13am

copy != sync . If the destination repository already contains snapshots that don’t correlate to a source snapshot, then you’ll end up with additional snapshots. Running forget at different times for a repository can result in different sets of snapshots.

That looks like a bug, could you open an issue on GitHub? The check command currently expects that the specified cache directory already exists.

teran · August 5, 2023, 11:12am

That’s ok, I understand it could be different in some cases, but it looks not in that particular one… let’s just clear, am I right guessing, when you have two repositories: A and B, both contain some snapshots, subsequent copy A → B and B → A will actually do a sync of these two repositories, so total amount of snapshots in both of them must always be equal (obviously it’s isolated case, just about copy, no forget, no additional snapshots, etc.) ?

The thing is I actually not running forget on remote repository, just prune to repack the data and check so it’s anyway looks strange. However to check if it’s not about some cached stuff I’m downloading all of three repositories at the time to check them locally, this would take some time for further experiments.

Sure, check should create cache directory if it doesn't exist instead of reporting an error · Issue #4437 · restic/restic · GitHub

MichaelEischer · August 5, 2023, 1:24pm

When just using copy then, copy A->B and copy B->A should result in identical number of snapshots.

Things get murky very fast, when snapshots are modified e.g. by changing their tags.

Running prune without forget won’t do much, except cleaning up a few small packfiles.

But you do run forget on the source repository?

The snapshot list is always read directly from the repository. (The snapshot contents may be cached, but as they are immutable, that can’t result in stale data)

teran · August 5, 2023, 2:41pm

That’s also expected behaviour so I suppose it’s not the case also, but I will check

That’s exactly why I’m running it - to cleanup small packs and compress uncompressed stuff before copying to remote repositories (since I have a lot of small snapshots from some systems and uploading them takes much more time)

Forget is periodically running on L1 and L2 repositories only (after copy from them I leaving only 100 last snapshots there), V could not have impact by forget

Hm… I probably could try cross-copy against R1, R2, R3 that should provide some details
And probably it’s time to compare snapshots lists, I’ll get back with the results

teran · August 13, 2023, 4:13pm

Getting back with the results

TLDR; after repository snapshots list comparison it looks like restic created some duplicated snapshots with different internal IDs.

Details:

id := strings.Join([]string{
	snapshot.Time.UTC().Format(time.RFC3339Nano),
	fmt.Sprintf("%d/%d", snapshot.UID, snapshot.Gid),
	snapshot.Username,
	snapshot.Hostname,
	strings.Join(snapshot.Tags, ":"),
	strings.Join(snapshot.Paths, ":"),
}, "|")

^ Here’s identified I’ve used for match, while tree and original fields was also used with exactly the same result.

Here’s the full code main.go · GitHub I’ve used to run against outputs of snapshots --json gathered from all of the repositories after successful copy pass (V → {R1, R2, R3}; {R1, R2, R3} → V; R1 → {R2, R3}; R2 → {R1, R3}, R3 → {R1, R2}. So I assume repositories was consistent at the moment of taking samples.

And the results, there are no unique (in terms of id above, tree field, original field) snapshots while the amount if snapshots varies from repo to repo:

Set#0 total entries: 11950
Set#1 total entries: 11969
Set#2 total entries: 11961
Set#3 total entries: 12160

So it looks like restic created duplicates somehow during copy, any thoughts?

MichaelEischer · October 1, 2023, 6:24pm

When copying a snapshot, restic stores its ID in the Original field of the snapshot. The original ID is only set once and never changes afterwards (unless you use rewrite). To prevent duplicate copies, the copy command only copies a snapshot if the destination repository does not contain a snapshot with that original ID (or the original snapshot). This should be enough to prevent duplicate snapshots even across multiple repositories.

However, there is one possible corner case that can result in duplicate snapshots: if two copy operations concurrently add the same snapshot to the same destination repository. As these snapshots get different IDs, there’s not much that can be done to prevent that.

To verify that guess: are there multiple snapshots in a single repository with the same Original field (ignoring snapshots with a nil value for that field)?

teran · October 1, 2023, 7:18pm

I actually got to the similar guess by with a bit different reason: how rclone works with backend and restic API, the guess come since I sometimes receive errors on Save() and Load() while all the repositories are used exclusively on copy in my case - two at once was never run for sure.

The same actually comes when I running check --read-all from some services (Mega for instance) - it sometimes could return error on Load() which could cause restic to threat repository containing errors.

Related to your guess:

22:09:45 ~ → jq .[].original < ~/Downloads/snapshots.json | wc -l
   29990
22:10:01 ~ → jq .[].original < ~/Downloads/snapshots.json | sort -u | wc -l
   29278

While in the source repository there are 29513 snapshots at the moment, so It is still something strange and I keep investigating when have time.