Tracking files being moved, renamed, duplicated (and, maybe, modified)

SebAlbert · April 15, 2023, 7:53pm

In the last couple of days, it occured to me that restic might contain a solution to a toy problem I’ve been carrying around for months: tracking files (and possibly their duplicates) as they are moved, renamed, and maybe even edited. Let me ramble a bit about my motivation first.

Over the years, I have accumulated external hard drives on which I put data occasionally that I wanted to “keep” - media file collections or bulk copying data from old computers as I abandoned them. Over time, for several reasons they have accumulated many duplicates - be it due to one of the disks deteriorating and me trying to copy everything to another one, or just because I had multiple copies of data on different machines that I eventually archived.

To prevent further bitrod, I bulk copied all contents from my external drives onto a large new ZFS “mirror” cluster in a new computer, with one dataset (or subdirectory) per external disk. Now I want to consolidate, but I thought it would be great if I could take a sort of snapshot of which file is where before my consolidation, in case I want to track down its history in some way later or just see how I used to structure files and directories back then.

Last year I began writing a Python script that would recursively walk a given directory and take note of all files, directories, symlinks, including some checksums, sizes, inodes, device ids, mount point, and hostname (which I now discovered is quite similar to what restic does), to save this to a database. I thought I could then later use the database to track down the evolution of files and (probable) duplicates by connecting those records by device/inode, checksum and/or filename over multiple “scans” over time, and thereby virtually “recreate” the old structure should I want to.

Now you can probably guess my question:
Is there already any functionality in restic or in another tool to leverage restic’s deduplication to list equal or almost-equal files (within and) across snapshots?
Also, is it possible to query the metadata not just by file name, but also by, say, device id and inode?
If not, does anybody have suggestions where to start building such an index separately from restic’s json output? I think I would start by putting restic’s indexes in a relational database like sqlite or Postgres to query it via SQL.

Is there any prior art for such things?

SebAlbert · April 16, 2023, 6:11pm

How would I most efficiently get at the “contents” chunk list for all files in a snapshot (or at some subtree of it)? restic ls doesn’t include that, and recursively calling restic cat for one directory at a time would probably be quite slow. I would mostly want to recursively get the “tree” of subdirectories including all file infos, including the list of content chunks and the other file/dir metadata.

JohnAtl · April 17, 2023, 1:45pm

Using restic and hand-coded python seems like a duplication of effort.
Would a zfs snapshot accomplish the same thing?
These snapshots can also be sent to another computer as a backup.

SebAlbert · April 17, 2023, 10:34pm

Thanks for the suggestion!
I think zfs snapshots won’t serve my purpose. I am looking to have rather fine-graned zfs datasets, i.e. my consolidation will cross many of their boundaries.

I am indeed considering zfs as underlying filesystem for the restic backups to simplify sending incremental diffs of the backup to georedundant copies, but that’s a different story entirely.

I think restic’s index makes my custom Python script obsolete, if only I could get at the entire metadata tree at once so I could take hold of it with data wrangling tools I am fluent in, such as SQL.

rawtaz · April 17, 2023, 10:45pm

What you describe here sounds more like regular version control. You have a data set, want to make changes, and then have those changes tracked. Use Git or similar for that.

There are tools for finding duplicates of files, which should solve the other part of what you want to do. I can’t mention any off the top of my head, but they’re easy to find on Google.

SebAlbert · April 17, 2023, 10:54pm

I am using fdupes already.

As restic already has my structure, as well as the chunked duplicate information (which gives me more than just duplicates, but also “almost-duplicates”), why can’t I just use that?

It would feel odd to use version control aimed primarily at text files (which can also handle binary) for large amounts of arbitrary data, much of which is binary.