Infinite retention for deleted files

ionos · December 6, 2018, 2:37pm

Hi,

One feature that I appreciated a lot with CrashPlan (when they still had a personal plan) was the ability to have an infinite (or independent) retention period for the last version of a deleted file.

The idea would be that when a file gets locally deleted, the last version in the back-ups is kept forever/for a configurable time.

Currently, with restic, if I’m not mistaken, deleted files might be pruned out off all snapshots if the file only lived for a short amount of time. Also, there is no way to ensure that the final version of a file is kept when removing the snapshot containing this final file (just before it was deleted, i.e. the following snapshot would not contain the file).

One idea could be to check, when removing snapshots, for all files whether the following snapshot still contains that file. All files that are not contained by the following snapshot could be added to a “fake” snapshot which would effectively replace the removed snapshot and only reference all the files deleted between this snapshot and the next (multiple directly adjacent fake snapshots could be merged into one).

I hope this makes any sense

Thank you,
Clemens

Dj0k3 · December 6, 2018, 4:05pm

This would be a interesting feature but I think it could be really difficult due to the restic design which is to take snapshots and forget/prune according to specific policies. You could make a special snapshot with a tag like keep, for example, to backup those files that you know could change or could be deleted and that way you can apply forget --keep-tag keep to never delete those specific snapshots. That’s one workaround that I know but it may not be entirely what you’re asking for.

This specific behavior could be achieved using rsync or rclone using --backup and --backup-dir. Those options in rsync and rclone could be useful in your case; files that are not in your local directory would be moved to your --backup-dir instead of being deleted.

ionos · December 6, 2018, 5:54pm

Thank you for your reply, @Dj0k3. But yes, I was looking for something more automated (I would definitely forget to run that special backup before deleting files most of the time …).

I might he been to naiv in my thinking, not being familiar yet with the code base …

My thought was to integrate with the policy-based forget-routine. Use the same logic to figure out which snapshots to forget, and run the “file-deletion-detection” logic for every snapshot that is to be removed; and instead of simply removing snapshots from the chain, replace them with the “fake” snapshots - which could be normal snapshots with a special tag.
On the forget-runs, all the “fake” snapshots would need to be ignored when calculating which snapshots to remove.

The pruning process should not be affected at all, I believe.

cdhowie · December 6, 2018, 9:55pm

The trick is that this feature is complex, particularly because restic separates file names / paths from their contents. The contents of a file are stored as one or more blobs, and then the tree (directory) object containing a file contains the file name and the blob id(s) that are concatenated to form the file contents. This raises several issues with the feature you propose:

If a file is moved/renamed, but not changed, then in restic, the tree entry is effectively moved/renamed in a subsequent snapshot, but the blobs remain the same. As far as restic is concerned, the file was deleted from one location and a new file with identical contents was created elsewhere.
If a file is changed, some or all of the blobs are different, and the difference is added to the repository as one or more new blob objects.

So, from this, we have to figure out what “last version of a file means.”

If a file is renamed, do we keep the data indefinitely, even though it wasn’t actually the “last version?” How should restic detect that a rename happened?
- What if you copy a file and then rename the original? The file history “splits” here, and both files can be changed independently, so at what discrete point in time do we decide that a “last version” needs to be produced?
- If a file is both changed and renamed, how do we detect that this even happened?
If a file is changed, then we don’t want to keep the old data indefinitely – unless that data is also part of a “last version” of something else, due to deduplication.

The feature you are asking for makes conceptual sense if you think of individual files as having some persistent identity. This is obvious to humans because this is how humans think.

However, to restic, there is no such persistent identity. Each snapshot is entirely independent and unrelated to all others except that they might share some objects. Restic does not make any attempt to track individual file history; it is only concerned with discrete snapshots of one or more directory structures.

Relationships between snapshots (such as in restic diff) are not persisted anywhere and are computed on demand. Any implementation of this kind of feature would need to solve the problems listed above.

ionos · December 6, 2018, 11:24pm

Thank you so much for your detailed reply, @cdhowie.

But I am wondering if we are overthinking the issue. Allow me to make a case for a simple, but concrete use case:

Accpetance Criteria: The last version of every file is retained in the archive when the file is deleted.

Assumptions: A file is defined by it’s location.

Out of scope: Handling of file movement, deletion of a file and re-creation of a file at the same location.

Which for me, the user in our contrived story, would be entirely fine.

Handling of renaming would not be part of this story.

cdhowie:

* What if you copy a file and then rename the original?  The file history "splits" here, and both files can be changed independently, so at what discrete point in time do we decide that a "last version" needs to be produced?

As we define a file by its path for the sake of this story, this would be a non-issue.

As above.

I believe this would be handled automatically by today’s mechanisms.

Actually, I don’t I did not imply any identity beyond the path.

Which would not be needed. The only relationship needed to successfully implement this story would be whether a certain path exists in the next snapshot. Thus such a relationship should be possible to compute on-demand, whenever restic forget is run.

The main question being: Does it add net value to the user? For me, it would; the benefits significantly out-weighing the potential cost of storage inefficiences. Without it, restic forget seems like russian roulette.

Thank you for bearing with me,
Clemens

Dj0k3 · December 6, 2018, 11:27pm

A really expensive idea could be to use diff after every snapshot comparing the latest to the one before that, restore the deleted files in a temp directory and take another snapshot with the “fake” tag just for that temp directory. But if deleted files are too big then it could take a lot of time.

Sorry, I didn’t really saw this was under “Features and Ideas”. I think it would be nice to have such feature. As for me, my use case doesn’t require this because even if one of my work files is deleted and I haven’t restore it in a week, then it means that’s an unneeded or not so important file for me. If it was important, then I’ll have at least a month to search for it and restore, which in my case is more than enough time. My files, once I finished working with them, are rarely used but I need to keep them at least for 10 years. So, my policy is to keep the latest 8 snapshots, 7 days, 4 weeks, 12 months and 10 years.

I think what you’re asking for is an incremental backup solution and AFAIK restic model is very different as duplicity, for example, which will make a full backup first and subsequent backups are incremental (will only add new data without deleting the old one). Downside about duplicity is that it will make a whole full backup from time to time to protect you from data corruption but it will take a lot more space than restic.

That is also why I suggested rsync or rclone in my comment.

cdhowie · December 7, 2018, 4:10am

Restic has literally all of the advantages of an incremental system, plus more, but with none of the downsides (as far as I can tell). The core concept of an incremental system is that we only need to store the content that makes the new backup different from the prior backup. Restic does this, and then some – so I would certainly claim that restic meets the criteria for being incremental, but with many advantages over tradition incremental systems like tar and duplicity.

In an incremental backup system, you cannot delete any backups, especially not the level-0 backup. Doing so would invalidate every following (non-level-0) backup. With restic, you can delete any snapshot at any time and not lose anything else.
Traditional incremental backup systems do not deduplicate any data within a backup, much less between backups. Restic can deduplicate between everything in the same repository, which means it can even deduplicate across multiple backups of multiple targets.

I don’t think “what you’re asking for is an incremental backup solution” makes sense then for two reasons:

Restic is, at its core, an implementation of incremental backup.
An incremental backup wouldn’t even help with the “keep the last version of deleted files” problem.

cdhowie · December 7, 2018, 4:36am

@ionos Alright, let’s run with that idea for a bit.

For brevity I am going to refer to the “last version” concept as an “archive.” An archive is a single file’s final version, as stored somehow in the restic repository.

Side note: nothing here is intended to be criticism of this idea overall. It’s just a brain dump of what I’ve been thinking in regards to it, in particular the technical challenges it presents.

Fair enough. So this system would have the following limitations:

A file that gets moved/renamed results in an archive of the file’s contents before it was moved.
If, between two backups, a file is deleted and a new file is created at the same path, this does not result in an archive of the deleted file because the backup system was not able to observe the absence of a file at that path.
Any kind of a collection/database type of file, which itself logically contains multiple files (think ZIP file, or mail client databases) is opaque to restic, and so deletions of files within that container will not create any sort of archive. (This would hardly be a limitation unique to restic; I’m not aware of other backup solutions that dig into collections.)

I don’t think this is unreasonable. We would need two additional processes to implement this.

We need the ability to determine when to create archives.
- This is still somewhat tricky. When a snapshot is forgotten, we need to locate a snapshot that qualifies as immediately-subsequent which we could do based on the same grouping mechanism selected with --group-by for the forget operation: look for the snapshot immediately following it based on timestamp, within the same group.
- Building on the above, say that we have four snapshots in a group: A, B, C, and D. We forget C; this creates archives for all paths in C that are not present in D. Then, in a later operation, we forget B. How do we determine which files in B to archive? D is now the immediately-subsequent snapshot, but comparing B to D is going to generate false positives – files in B might have also been present in C, but not in D. So they were archived from C, and they will be again be archived from B. We would need a way to notice the presence of C and use it alongside D to compute an “effective state” for what paths were contained in C.
- If forgetting multiple snapshots, restic should determine which archives to create for all snapshots to be forgotten before any of the snapshots are actually forgotten. In other words, in the A-B-C-D example, if we forget B and C at the same time, B compares to C and C compares to D, then B and C are forgotten. (This prevents the case where restic decides to process C first and compares C to D. If it then forgets C, then B would compare to D instead of C.)
We need somewhere to reliably and sensibly collect such archives.
- It makes obvious sense to store them in the same repository somehow; this way we can simply not delete the blobs for an archive, and these blobs can continue to participate in deduplication.
- Should archives be stored in a new snapshot? (An “archive snapshot.”)
  - This would cause a forget operation to potentially create as many new snapshots as it forgets. (Not a problem, just pointing it out.)
  - How do we differentiate between a regular snapshot and an archive snapshot? A tag? Some new attribute?
  - What should happen if you try to forget an archive snapshot? Should this be permitted? If it is, it would have to be handled specially so as not to create another archive snapshot.
- Should a new repository object type be introduced for archives?
  - This would potentially be less confusing and would avoid cluttering the snapshot list with archives.
  - This would also avoid bugs where archive snapshots are treated as regular snapshots, both in restic and in scripts/tooling that has been built up around restic.
  - On this other hand, this adds some complexity to the repository model by introducing a new type that needs its own inspection / restoration commands. (restic archives, etc.)

764287 · December 7, 2018, 8:25am

If the main goal is to always keep the latest version of a file and you don’t care about file history and deduplication, rclone copy might be a better fit for your needs.

If anything, restic is more like a differential backup where every subsequent backup after the initial backup is independant from each other.

Dj0k3 · December 7, 2018, 3:43pm

This is exactly why I pointed out duplicity as an example. Don’t get me wrong, I love restic and I switch from duplicity to borg and then restic and now I’m using restic for everything. What I mean is that, with restic you can actually delete a snapshot that may have a unique file in it and lost it forever. In an incremental solution like duplicity even tho you can delete a backup older than X time, incremental backups depends on a full backup, so is harder to lose data when there are dependencies.

I agree; restic deduplication and model (snapshots) are far superior and if I were afraid to lose data, I just kept every snapshot.

As for this Idea, the only thing that would worries me (because of space) is that if I delete a snapshot because I want to delete an unneeded file stored in it and this file is only present in this snapshot, then deleting the snapshot is worthless because, assuming that restic decides to keep “deleted data”, then it will actually not delete the file that I wanted to delete.

The solution would be to specify the file you want to keep or have an option to delete files in the “fake” snapshot that you don’t want to be there. Or at the end you will have to check what’s inside that “fake” snapshot, restore it to save those needed files and then delete the fake snapshot because if you’re worried about space, then this fake snapshot will make the repo bigger and bigger even when you delete other snapshots.

This would be the same as keeping all snapshots but I like it because it would be a little bit less of a hassle to look in one “fake” snapshot what files were deleted.

cdhowie · December 7, 2018, 4:44pm

Basically, “if you can’t delete anything then nothing can be deleted.” If you delete one thing, you lose everything – this makes you hesitant to delete anything. I’m not sure I would call that an advantage. If that’s what you want, then just don’t ever remove any snapshots. You can even enforce this by only interacting with the repository over a REST server in append-only mode.

I’d assume the feature would be entirely optional. You’d either have to specifically configure the repository to enable this feature, or pass some flag to restic forget to opt-in.

Dj0k3 · December 7, 2018, 7:02pm

That’s exactly my point. It forces you to not delete anything because you may be ending up losing valuable data.

It is not an advantage, I didn’t say that and my intention was not to imply that.

The way I see it, duplicity is great and there are many other backup tools that use an incremental design that are also great tools but they’re not even made for modern storage solutions and with data deduplication being available, I see that kind of tools a little outdated. This is why I decided to move to deduplication because is smarter and saves even more space than compressing files, and that is a big step for deduplication backup tools because space means money in a lot of cases.

What does this new feature would do? Using just available features in restic to do this manually, maybe an idea would be to execute forget with --dry-run first to see which snapshots would be deleted. Then compare (diff) those snapshots that are going to be deleted with the oldest one that are going to be kept (or to the local files). Then compare the versions of that specific file that is going to be deleted and select the latest one only; those “latest files” would need to be restored (?) to a tmp location to create a new snapshot with a special tag (“Deleted” maybe) to easily identify that snapshot and know that is the one containing the deleted files from other snapshots; at the end, then it will run the actual forget and prune. Next time you perform this then you’ll need to restore the “Deleted” snapshot too to the same tmp location so at the end you end up with just one snapshot with a “Deleted” tag containing all deleted files or just omit that and have a different snapshots with the tag “Deleted”. This would be easier if you could just add data to an existing snapshot, but it could be dangerous too. This is just like a brainstorming, I don’t know a lot, so…

Maybe there’s a way to script the hug out of this (following the new Linux CoC).

ionos · December 7, 2018, 9:34pm

My apologies if I did not elaborate enough on this point: For creation and file modifications, I very much do care about de-duplication. For everything except file deletion, restic meets my needs just fine.

It is only the file version right before a file’s deletion that I would like to keep for an independent amount of time/forever, to effectively have an archive function.

ionos · December 7, 2018, 9:45pm

Thank you, @cdhowie, this is a great write-up. And I agree with your concerns, these steps might not be trivial, require further thought, etc.

It was an idea/suggestion, something to keep in mind.

Maybe means to create snapshots manually based on existing snapshots using only some of the files of that snapshot without requiring any local file operations (e.g.: “restic newsnapshot --based-on BaseSnapShotID /file1 /file2 /file3”) could be a first step towards solution. Then all the logic could be externalized into some scripts, to be written by the user, and it would be re-usable for all kinds of weird ideas …

Edit after reading all the posts:

I replied to the earlier post too quickly. This could indeed be a viable solution, esp. if one feature could be added to restic, as described above, namely to create new snapshots based on an existing one with a sub-set of the existing snapshot’s files. This way, the temporary checkout would not be needed.

ionos · December 7, 2018, 9:48pm

Indeed. Also, if such an idea ever came to fruition, one could even limit the “keep-last-version-of-deleted-files” functionality to certain sub-directories (e.g. actual documents, but not application config files).

sulfuror · February 25, 2019, 5:37pm

Hey, I read this thread some time ago and started trying to do this. So, I have created a function for my script called “archive”. You can see the whole script here and you’re free to just copy and edit it for your needs, or just use/edit the function for your needs. You’ll find the function from lines 1538 to 1640. This function does exactly what @Dj0k3 said, but rather than comparing the latest snapshots to the one that are going to be deleted, this function compares the latest two snapshots. Note that this “version 4.0” of my script is in another branch separated from the master because I’m not finished yet; I’m still not totally convinced and want to review a few more things.

So, this function compares the two latest snapshots with diff; then if there are deleted files from the latest snapshot it restores the deleted files to /tmp/archive. Then it creates a new snapshot using --time '2015-01-01 00:00:00' --tag archive --host $HOSTAME. The function itself have the ability to specify the hostname if you want. For example, if you’re running this script you could do rescript [repo_name] archive --host [hostname] and it will do its thing for this specified hostname only (if you have more than one host in your repository).

Why is it comparing the two latest snapshots? I figured, from snapshot to snapshot differences (at least in my case) tend to be a lot less. I set backups every two hours. So, if differences are less then it will take a lot of less time. I tested with changes of 10,000 files and it gave me errors but I think it is because it will pass --include for every file it needs to restore and apparently restic gets a little troubled by this action. But with changes of aprox 3,400 files it worked perfectly and the process didn’t took a whole 5 minutes in my case. What I do is to run the archive thing after every snapshot, so all changed would be archived.

I hope this helps. Like I said, feel free to copy it, edit it for your needs, etc. Feedback is really welcome and needed. This is just a personal project but since this is me alone, it will be nice to know what I’m doing wrong here, if there is something that can be done better, easily, etc.

whereisaaron · March 4, 2019, 3:02pm

Nice @sulfuror. I was reading the thread and thinking the same approach would work. It is not ideal that you have to restore and re-backup the diffs, but the process is effective IMHO.

You could wrap your restore in xargs -n or create a temporary file and use restic --include-from my-tmp-file to avoid topping out with too many files.

P.S. if you wanted to, bash ‘here docs’ might be helpful for multi-line help text in your script
http://tldp.org/LDP/abs/html/here-docs.html

sulfuror · March 4, 2019, 7:34pm

Thank you! I will definitely take a look at the here docs. The problem with the --include is that --include-from is not available for restore, just for backup. It would be much easier that way and in fact, it would be a great addition to restic because if you have a lot of files with a similar name, for example, you can just ls [snapshotID], make a list and filter that list and then use restore --include-from list.txt and that’s it. But right now you must use --include for every file you need to restore and that’s why I took this approach to create a list with diff, chose only the removed lines and then make an array with eval so it reads the quotes and read the output correctly.

Again, thank you. I will look at the here docs and test with xargs.

enboig · March 7, 2019, 4:26pm

The problem I am facing is that when forgetting snapshots, some files disappear.
Example:
I have Snapshots 1, 2, 3, 4, 5, 6
When I forget snapshots 3 and 4, all 3’s and 4’s files which weren’t into 5 disappear, and just old versions of updated files are kept. I think a better idea than a snapshot with all deleted files (which wouldn’t take care of new versions), would be a pseudo 4b snapshot which contains a merge of 3 and 4.
I know this solution would be useless for something like a git repository, but it would be helpful for single files.

sulfuror · March 7, 2019, 9:01pm

@enboig if you’re using my script, what version are you using? If you were using the one in the branch I posted, then I don’t know, it was a testing thing yet. What I do noticed was what I mentioned in the changelog today. This function worked okay but I noticed that if, for example, in snapshot 1 was a file that was deleted from snapshot 2, then the first “archive” would be okay. Then if that same file was in snapshot 3 but it was not in snapshot 4, “archive” function will not save the new version because when syncing directories, the rsync was doing this: rsync -a /path/to/old/snapshot/* /tmp/archive (this was assuming that deleted files could not be restored, edited and deleted again at some point). So instead of syncing changes it was in fact reverting changes made for existing files. I fixed this making rsync to sync first the old snapshot to a new directory and then the latest snapshot to the same new directory so when the snapshot is taken, it should then have the new version of deleted files. So, the latest version 4.1 should do what I think you’re referring to.

If that was not about the script, then sorry, it just appeared as a response for my comment in the notification area.