In the last couple of days, it occured to me that restic might contain a solution to a toy problem I’ve been carrying around for months: tracking files (and possibly their duplicates) as they are moved, renamed, and maybe even edited. Let me ramble a bit about my motivation first.
Over the years, I have accumulated external hard drives on which I put data occasionally that I wanted to “keep” - media file collections or bulk copying data from old computers as I abandoned them. Over time, for several reasons they have accumulated many duplicates - be it due to one of the disks deteriorating and me trying to copy everything to another one, or just because I had multiple copies of data on different machines that I eventually archived.
To prevent further bitrod, I bulk copied all contents from my external drives onto a large new ZFS “mirror” cluster in a new computer, with one dataset (or subdirectory) per external disk. Now I want to consolidate, but I thought it would be great if I could take a sort of snapshot of which file is where before my consolidation, in case I want to track down its history in some way later or just see how I used to structure files and directories back then.
Last year I began writing a Python script that would recursively walk a given directory and take note of all files, directories, symlinks, including some checksums, sizes, inodes, device ids, mount point, and hostname (which I now discovered is quite similar to what restic does), to save this to a database. I thought I could then later use the database to track down the evolution of files and (probable) duplicates by connecting those records by device/inode, checksum and/or filename over multiple “scans” over time, and thereby virtually “recreate” the old structure should I want to.
Now you can probably guess my question:
Is there already any functionality in restic or in another tool to leverage restic’s deduplication to list equal or almost-equal files (within and) across snapshots?
Also, is it possible to query the metadata not just by file name, but also by, say, device id and inode?
If not, does anybody have suggestions where to start building such an index separately from restic’s json output? I think I would start by putting restic’s indexes in a relational database like sqlite or Postgres to query it via SQL.
Is there any prior art for such things?