Continuous backup

I’m migrating from Crashplan, and I like the continuous backup feature. Crashplan used inotify to monitor for file changes and every 15 minutes backed up the modified files.

  • What are the implications if I run a backup every 15 minutes?

  • It will create around 100 snapshots per day - is this a problem?

  • Files have to be scanned every run. It’s a bit of a waste - is there a better solution?

  • If I hack my own solution (for example by getting a list of modified files with inotify) can I merge these changes into a snapshot, or do I end up with a lot of small snapshots?

Thanks

If you don’t mind me asking: why do you want continuous backup? What benefits does it give you over a scheduled backup?

Well, two things:

  • When I work on a document or on a project, a day represents a lot of work. So the question is what is the optimum schedule… I’d be okay with an hourly backup (15 minutes is what CrashPlan provided), but along the continuum between a day and instant, the shorter the window is - the more significant the extra work of going through all the files becomes and the more helpful the inotify scenario becomes.

  • On (uncommon) occasions I found myself making some change by mistake that I wanted to undo, and I went back in time an hour or two to get a previous version. It’s not an absolute must, I rarely f’ up that bad.

1 Like

When it comes to backups there is no ‘one-size-fits-all’ approach. Everything depends on your data sets, workflow and storage capacity.

How much data do you want to backup? Do you have data sets that changes more often then others? Does deduplication work well on your data sets? What is your planned retention period?

If your data is small (i.e. a few GB) making backups every 15min is no problem and shouldn’t take more than 1min. If you have lots of data and only a subpart of it is changing often, then you should probably split the data into different data sets and adjust the time span between backups and the retention period to your needs.

I have around 70GB I want to back up on this computer. The stuff that changes often is much smaller - around 1/3 of that.

If I’m smart about it, I can probably split that and run multiple jobs for the different type of data (for example my documents folder, project folder and browser profile would update very often).

But to be honest, while I can script the heck out of anything, sometimes I want a drop-in solution. I want backup to magically happen and out of the way, because I don’t want to manage it. It’s just another thing I can f’ up and the implications are serious. So I rather have a solution I point at my home directory and let loose (other than some really necessary excludes) than a solution that I have to customise not only once but every time I add a project or make changes to the way I work otherwise.

1 Like

That’s a relatively small amount of data and backups probably won’t take long. Just give it a try.

I absolutely understand that you don’t want to waste too much time on this. I didn’t want to say that different data sets are necessary, I simply wanted to suggest it as an option. Having 1 backup job to handle all the data is perfectly fine. After all only you know which data you have and how to handle it best.

Yes, I’ve already started. It’s just that every invocation starts with around 1 minute of scanning my files again. I was hoping to cut that short by using the inotify interface.

It sounds like you want a sync service more than a backup service. Like, Dropbox or Google Drive. At least, that would be easier, no?

Nope. I want a backup service:

  • I don’t need to sync to a different device

  • I want different versions of every file that I change, and access to files I deleted

  • I don’t want to pay sync prices, backup back-ends are much cheaper

  • I want the backup to be encrypted on the back end, most sync services don’t allow that

I was happy with CrashPlan, which had the continuous backup feature.

More concisely: I’m very happy with what restic offers. The only thing I’m missing is the ability to backup frequently without the penalty of scanning all files.

You have some good points, I can appreciate that.

It’s worth noting that you can use a sync service with a single device, and that most sync services I’ve used track changes and let you see different revisions of a file, including deleted ones.

You are right about backup storage usually being cheaper than sync storage, though.

For what it’s worth, you can tell restic to back up just one file. All you have to do is wrap inotify over restic and invoke backup on the file that changed.

In the next release of restic (currently on master) using the --quiet argument will skip the initial scan. So maybe give that a try when it’s released. It will still have to work through all your files of course, but as the initial scan is only used for progress estimation, it’s not needed when running in quiet mode.

Source: restic/changelog/unreleased/pull-1676 at 6eb1be0be477b4d9064f5c49558a4ca768dd54aa · restic/restic · GitHub

1 Like

That is exactly what I was asking - if I can integrate it with inotify somehow. It’s not just the ability to select files to backup - it’s also whether the high number of snapshots would impact performance over say years of operation.

Also, having snapshots of individual files will make it really difficult to forget snapshots - current forget mechanism rely on the frequency of snapshots. If I backup /home/user every hour, after 7 days I can forget all but the 1st one of each day for example. If I backup individual files - I get a snapshot containing only /home/user/file1 and another snapshot containing only /home/user/file2 etc. so I not only have to back individual files, I have to somehow create a new snapshot that contains all the files of the previous snapshot of the same root directory (I’m guessing restic uses pointers to existing blobs), except for the files that were changed.

Now that I had to write it down to explain it I realise that this will require code changes… Is there a way to suggest new features?

I will definitely use that one, thanks.

I haven’t tried, but isn’t there restic find for this sort of thing, to find a file in a recent snapshot? Because if that worked, and if having lots of little snapshots wasn’t a bad thing, you could use find to locate the file(s) you want and then restore from the snapshot it’s in.

You can open an issue on GitHub: Issues · restic/restic · GitHub.

There is. But if you want to restore a folder where one file changed later, you have to know to restore the folder first, then the file. It can be a mess.

I was about to, but being a good boy I first searched for similar issues and found one, from a few months ago:

The author pretty much follows through the same rationale that I did, complete with the interim and final conclusions in this post, like he read my mind before I even thought about it. Kudos @alphapapa. Looks like it was accepted as a feature request by @fd0.

2 Likes

Great minds… :slight_smile:

1 Like

Your should take a look at fswatch to trigger restic on file change.

1 Like

Since looking at restic I have had the same thoughts as @alphapapa and @arikb

My performance problem with restic is not the actually backing up of files or dedup or (lack of) compression, but the fact it spends 99% of its time pointlessly scanning 100,000’s of files to notice they haven’t changed since an hour ago :slight_smile: That’s a lot of wasted effort and I/O that continuous solutions like CrashPlan and Carbonite avoid.

restic support backing up a list of specific files (--file-from). So my thought was to have a inotify wrapper/daemon that would build a list of modified files for 15-60 minutes, and the dispatch a restic backup with the accumulated file path list. Then once every 12-24 hours it would run restic with a full scan to ensure nothing is missed (inotify and NTFS events are not guaranteed, they use kernel memory and just get dropped if not picked up in time).

A continuous wrapper would reduce hourly restic backup times from ~30 minutes runtime to ~3 minutes. An greatly reduce the hourly I/O operations on the filesystem.

Has this been done already? Anyone seen a generic inotify wrapper that could do this or be adapted to do this?

I guess a shell script with inotifywait could do it for local filesystems.

1 Like

Just tried this as my home folder backups are very slow even for a few changed files.

Using fswatch on MacOS I simply piped the files to a logfile and used that as input to --files-from when backing up with restic.

Seems to work fine. Just make sure you’re not piping to the logfile while restic is running as restic only seems to check for file existence initially, and will fail if any other files are added by fswatch later, that don’t exist when restic tries to back them up (crash with lstat /path/to/file: no such file or directory). I just rotated the log and ran restic on the previous logfile.

I could not get restic to apply my exclude file using --exclude-file when used together with --files-from so for this to work you’d have to parse the logfile manually and remove stuff you don’t want. Actually the exclude file seems to work. It’s just that it still lists it in the snapshot (because it’s part of the backup command I guess), but the actual path is empty when I try to restore.

So it seems like a possible but maybe a bit of a hacky solution. Not sure it will scale to many files. Also your snapshots will look like crazy trying to list all the individual files backed up in the snapshot. :slight_smile:

It’s definitely fast though. I’m a bit paranoid, so I back up my home folder once per hour. Normally this takes about 50 minutes for a full scan, plus upload. Using the fswatch hack above cut that down to 9 seconds for an hour’s worth of changes.

I know it’s not surprising given what’s happening. Probably just an indication that my current naive restic setup is not ideal.

The only thing preventing me from using it at this point is the horrible snapshot log I get… :stuck_out_tongue: (Using restic snapshots --compact helps with this)

Edit: I should note that I have millions of data files that I backup. Some of them change often while others never change - so I guess I should only backup the static ones once in a while.

On second thought this wont work well because there’s no meaningful parent snapshot to use. So deduplication won’t work and finding a backed up version of a file will require you to look through all snapshots - not just the latest.