I have a repo that I created and usually back up to on Windows. I plugged the same drives into my Mac and ran the same backup command pointed at the same directory (both v0.15.2), and even though there should only be a few new files, it seems to have backed up many duplicate files, perhaps even all. Here’s what it said when it completed:
[0:01] 100.00% 12 / 12 index files loaded
processed 258017 files, 1.789 TiB in 2:55:21
snapshot dd0b8a93 saved
When I do ‘diff’ to compare the files of the latest snapshot to the prior one, it indeed lists a bunch of files that had already been backed up a long time ago.
Is there a way to check for sure that it has actually backed up a bunch of duplicates? I’m pretty sure it has since the free space on the drive with the repo seemingly went down by a lot, but I can’t be certain
Why might this be happening? Shouldn’t deduplication work cross-platform?
It probably did not select the existing snapshot as parent since both hostnames and the mounted folder names differ, which also can cause all files to be seen as “new”. But seeing:
Added to the repository: 269.204 GiB
might hint there were some changes between 2 runs.
But I am not sure if not being able to find a parent snapshot would cause more data addition as a result, someone else might have better idea about it.
Hm, “Added to the repository: 269.204 GiB (264.561 GiB stored)” seems weird. Other than that I would expect all files to be duplicated, because the host and path has changed but yet the overall size of the repo should stay the same due to deduplication. If you do the backup on a machine without an old cache, restic has to actually go through all chunks and check if they are there already.
Can you verify whether the repo actually duplicated in size?
Oh, I can check the size it is now, but I didn’t check the size of the repo folder beforehand. Based on a vague memory of what size it was, it seems like it’s grown considerably but not doubled, but I can’t be sure. I guess I could delete the snapshot and do it again, but it took 3 hours.
Maybe someone else has a different idea how to check this?
I’ve honestly never had that problem. But from my understanding of how restic works, it’s not even possible that it added files/chunks that were already in the repo because they would have the same checksum and thus filename. But maybe a correction of that assumption will quickly follow
Many of the files with “+” on the list were files that were already in there, so it seems to be saying they were added, but as @nicnab said maybe it’s expected that files would be added but not data blobs; but it looks like data blobs did increase? A surprising number of removed files too.
And sorry to make things complicated, but just to be upfront: there were some new files on the drive since the last backup, but really only a few; wouldn’t have been more than like ~1000 new files totalling ~15GB, nothing close to 200k files and 270GB.
If it’s necessary to get to the bottom of this I could delete the latest snapshot, do the backup on Windows, then come back and do it again on Mac to be sure that 100% of the files that were “added” were already backed up. That would take a few hours, if anyone thinks that would be helpful I’ll do it though!
Thanks everyone for your help so far! Hopefully we can figure this out.
This is not strange, but we have 100k files less in the new backup, which is unexpected if the source folder wasn’t changed.
Also the backup added 269Gb in files, but the disk contains 1.789Tb of data, so I would say restic recognized almost 85% of the data as already stored in the repo. Not too bad .
Do you have inclusion/exclusion rules that could match differently between windows and linux runs? I understand you don’t want to publish the whole restic diff output, but I suspect there is something valuable there, since most of the data was correctly matched.
I would look at that diff output and check that every “+” line in the linux snapshot have a matching “-” in the windows one, and focus on the unmatched lines.
The deduplication code is entirely platform independent, so if the source folder read by restic on Windows and macOS were the same, then there wouldn’t be any differences in the data blobs. Only the tree blobs would differ.
That looks like some folder can’t be read due to file permissions. Or maybe some special filetypes or encrypted files exist?
Hmm, AFAIK exFAT doesn’t support any file system feature that could make files look differently when switching between OSes. So, I guess your only option is to try and identify which files only exist in one of the snapshots and then take a closer look at them.
Sorry that we might never get the answer, I have Windows on my Mac via Parallels and I’ll probably just do my backups on that. Even if deduplication worked, I want to be able to traverse the directory via fuse and that doesn’t seem possible in this scenario since it views all the files as being removed and added again in the “new” macOS-style path.
It would be nice if there was a way to have every file defined by its path relative to the drive root, rather than including the drive letter/name in its path so that this wouldn’t be an issue, but maybe that is wishful thinking.
It would make it easier if there was a way to navigate just the files that were added in a certain snapshot when mounting a repo with fuse, although diff is working too, I just have to export its output to a text file and search for "+ " to see files added.
You could also give the following a try (after replacing the placeholders obviously): restic diff windowsSnapshotId:/c:/path/prefix/on/windows macSnapshotId:/Volumes/path/prefix/on/macOS. With that syntax and a somewhat recent restic version, diff will only compare the disk content and ignore the different path prefixes.