Deduplication seemingly not working cross-platform

brilliant · January 10, 2025, 1:18pm

I have a repo that I created and usually back up to on Windows. I plugged the same drives into my Mac and ran the same backup command pointed at the same directory (both v0.15.2), and even though there should only be a few new files, it seems to have backed up many duplicate files, perhaps even all. Here’s what it said when it completed:
[0:01] 100.00% 12 / 12 index files loaded

Files: 258017 new, 0 changed, 0 unmodified
Dirs: 23516 new, 0 changed, 0 unmodified
Added to the repository: 269.204 GiB (264.561 GiB stored)

processed 258017 files, 1.789 TiB in 2:55:21
snapshot dd0b8a93 saved

When I do ‘diff’ to compare the files of the latest snapshot to the prior one, it indeed lists a bunch of files that had already been backed up a long time ago.

Is there a way to check for sure that it has actually backed up a bunch of duplicates? I’m pretty sure it has since the free space on the drive with the repo seemingly went down by a lot, but I can’t be certain
Why might this be happening? Shouldn’t deduplication work cross-platform?

Thank you for your help!

gurkan · January 10, 2025, 2:07pm

It probably did not select the existing snapshot as parent since both hostnames and the mounted folder names differ, which also can cause all files to be seen as “new”. But seeing:

Added to the repository: 269.204 GiB

might hint there were some changes between 2 runs.

But I am not sure if not being able to find a parent snapshot would cause more data addition as a result, someone else might have better idea about it.

nicnab · January 10, 2025, 3:40pm

Hm, “Added to the repository: 269.204 GiB (264.561 GiB stored)” seems weird. Other than that I would expect all files to be duplicated, because the host and path has changed but yet the overall size of the repo should stay the same due to deduplication. If you do the backup on a machine without an old cache, restic has to actually go through all chunks and check if they are there already.

Can you verify whether the repo actually duplicated in size?

brilliant · January 10, 2025, 3:42pm

Can you verify whether the repo actually duplicated in size?

How would I check?

nicnab · January 10, 2025, 3:54pm

On macOS you can use du -ms repo-folder/ and get the size in MBs.

brilliant · January 10, 2025, 4:26pm

Oh, I can check the size it is now, but I didn’t check the size of the repo folder beforehand. Based on a vague memory of what size it was, it seems like it’s grown considerably but not doubled, but I can’t be sure. I guess I could delete the snapshot and do it again, but it took 3 hours.

nicnab · January 10, 2025, 6:10pm

Maybe someone else has a different idea how to check this?

I’ve honestly never had that problem. But from my understanding of how restic works, it’s not even possible that it added files/chunks that were already in the repo because they would have the same checksum and thus filename. But maybe a correction of that assumption will quickly follow

GuitarBilly · January 10, 2025, 6:28pm

@brilliant I thought to use restic stats but a quick test on my repo shows it does not seem to provide the info you are looking for.

However restic diff seems to do it.
https://restic.readthedocs.io/en/stable/040_backup.html#comparing-snapshots

It has an example:

$ restic -r /srv/restic-repo diff 5845b002 2ab627a6
comparing snapshot ea657ce5 to 2ab627a6:

M    /restic/cmd_diff.go
+    /restic/foo
M    /restic/restic

Files:           0 new,     0 removed,     2 changed
Dirs:            1 new,     0 removed
Others:          0 new,     0 removed
Data Blobs:     14 new,    15 removed
Tree Blobs:      2 new,     1 removed
  Added:   16.403 MiB
  Removed: 16.402 MiB

So if the deduplication works then I expect the Data Blobs not to increase. Or a little bit for metadata?
(Maybe a developer can comment on that)

EDIT:
i now read in your first post that you already did a diff. can you post the results here? especially the bottom summary?

brilliant · January 10, 2025, 8:59pm

Yes, here’s the summary at the bottom of diff (after a bunch of files were added):

Files:       258017 new, 337784 removed,     0 changed
Dirs:        23516 new, 24605 removed
Others:         19 new,     0 removed
Data Blobs:  209254 new, 91957 removed
Tree Blobs:  22162 new, 22395 removed
  Added:   269.224 GiB
  Removed: 35.568 GiB

Many of the files with “+” on the list were files that were already in there, so it seems to be saying they were added, but as @nicnab said maybe it’s expected that files would be added but not data blobs; but it looks like data blobs did increase? A surprising number of removed files too.

And sorry to make things complicated, but just to be upfront: there were some new files on the drive since the last backup, but really only a few; wouldn’t have been more than like ~1000 new files totalling ~15GB, nothing close to 200k files and 270GB.

If it’s necessary to get to the bottom of this I could delete the latest snapshot, do the backup on Windows, then come back and do it again on Mac to be sure that 100% of the files that were “added” were already backed up. That would take a few hours, if anyone thinks that would be helpful I’ll do it though!

Thanks everyone for your help so far! Hopefully we can figure this out.

GuitarBilly · January 12, 2025, 2:41pm

i think developer input is needed here; tagging @fd0 @MichaelEischer @rawtaz

radel · January 12, 2025, 5:14pm

I don’t know what is happening here, but I would point out some (for me) interesting points.

The new backup considered as removed all the old files and added everything as new:

Files:       258017 new, 337784 removed,     0 changed
Dirs:        23516 new, 24605 removed
Others:         19 new,     0 removed
Data Blobs:  209254 new, 91957 removed
Tree Blobs:  22162 new, 22395 removed
  Added:   269.224 GiB
  Removed: 35.568 GiB

This is not strange, but we have 100k files less in the new backup, which is unexpected if the source folder wasn’t changed.

Also the backup added 269Gb in files, but the disk contains 1.789Tb of data, so I would say restic recognized almost 85% of the data as already stored in the repo. Not too bad .

Do you have inclusion/exclusion rules that could match differently between windows and linux runs? I understand you don’t want to publish the whole restic diff output, but I suspect there is something valuable there, since most of the data was correctly matched.

I would look at that diff output and check that every “+” line in the linux snapshot have a matching “-” in the windows one, and focus on the unmatched lines.

MichaelEischer · January 14, 2025, 6:46pm

The deduplication code is entirely platform independent, so if the source folder read by restic on Windows and macOS were the same, then there wouldn’t be any differences in the data blobs. Only the tree blobs would differ.

That looks like some folder can’t be read due to file permissions. Or maybe some special filetypes or encrypted files exist?

What filesystem does the drive use?

brilliant · January 16, 2025, 4:24am

The drive uses exFAT

MichaelEischer · January 18, 2025, 6:31pm

Hmm, AFAIK exFAT doesn’t support any file system feature that could make files look differently when switching between OSes. So, I guess your only option is to try and identify which files only exist in one of the snapshots and then take a closer look at them.

brilliant · January 18, 2025, 6:59pm

Sorry that we might never get the answer, I have Windows on my Mac via Parallels and I’ll probably just do my backups on that. Even if deduplication worked, I want to be able to traverse the directory via fuse and that doesn’t seem possible in this scenario since it views all the files as being removed and added again in the “new” macOS-style path.

It would be nice if there was a way to have every file defined by its path relative to the drive root, rather than including the drive letter/name in its path so that this wouldn’t be an issue, but maybe that is wishful thinking.

brilliant · January 19, 2025, 8:17am

It would make it easier if there was a way to navigate just the files that were added in a certain snapshot when mounting a repo with fuse, although diff is working too, I just have to export its output to a text file and search for "+ " to see files added.

radel · January 19, 2025, 9:25am

If you often need to know what was added during a backup run you can call restic backup adding -vv to your options and save its output.

You’ll find a line for each file/directory in the path, prepended with new, modified or unchanged.

It lacks a “deleted” type because, I think, it prints a line for each file/dir found during the disk scan.

MichaelEischer · January 19, 2025, 1:43pm

You could also give the following a try (after replacing the placeholders obviously): restic diff windowsSnapshotId:/c:/path/prefix/on/windows macSnapshotId:/Volumes/path/prefix/on/macOS. With that syntax and a somewhat recent restic version, diff will only compare the disk content and ignore the different path prefixes.