Identification of near-duplicates

mark999 · March 6, 2024, 3:03am

Hi!

I haven’t used restic yet but it looks pretty awesome.

Has there been any consideration given to using minhash or the like to identify files that are similar but not identical?

I have a bunch of old disks (and images thereof) lying around. Aside from purely duplicate files, I suspect there are a lot of near-duplicate files where I copied a directory over to a new computer and only made small changes. Once I’ve located one file in my backups, I’d like to be able to query for occurrences of similar files, where the similarity is by file name, file content, or maybe even full path to the file.

I searched but it appears restic doesn’t currently have functionality like this.

Since the files aren’t identical, it wouldn’t be possible to deduplicate them except destructively. However, if the near-duplicates are sufficiently large, it could potentially save space to order them for consecutive compression.

martinleben · March 6, 2024, 9:09am

Welcome!

Deduplication is not performed on file level, so near-identical files which are “sufficiently large” will be deduplicated when it comes to the parts of the files which are in fact identical.

nicnab · March 6, 2024, 10:09am

Here’s a little explanation.

mark999 · March 7, 2024, 4:55am

I see. I’m interested in this as a search feature rather than a space-saving feature.

mark999 · March 9, 2024, 5:36am

@martinleben @nicnab Any thoughts on a search feature for near duplicates?

martinleben · March 9, 2024, 4:26pm

Restic is a backup program.

I don’t understand what you want. Can you please describe your use case and your goal?

mark999 · March 9, 2024, 9:12pm

In the old disks/disk images I have laying around, I suspect there are a bunch of near-duplicate files - i.e. I copied a file from an old computer to a new one and changed the file slightly.

When these backups are all in restic, it’d be nice to be able to search for files that only differ slightly.

kapitainsky · March 9, 2024, 9:31pm

restic is a backup software - identical or slightly different files will result in significant space savings when backed up to the same repository.

If you need a tool to search content based on your specific criteria I suggest you look for software designed to do it. restic has clear objectives it does rather well and it does not try to replace other programs. It does not allow editing pdf files nor writing essays neither.

Your question touches interesting subject but does not belong to this forum.

nicnab · March 9, 2024, 10:46pm

winkingSage · March 16, 2024, 9:46am

If you need deduplication give GitHub - qarmin/czkawka: Multi functional app to find duplicates, empty folders, similar images etc. a look. It is able to find similar file with the correct settings too.