Continuous backup

arikb · April 15, 2018, 1:16am

I’m migrating from Crashplan, and I like the continuous backup feature. Crashplan used inotify to monitor for file changes and every 15 minutes backed up the modified files.

What are the implications if I run a backup every 15 minutes?
It will create around 100 snapshots per day - is this a problem?
Files have to be scanned every run. It’s a bit of a waste - is there a better solution?
If I hack my own solution (for example by getting a list of modified files with inotify) can I merge these changes into a snapshot, or do I end up with a lot of small snapshots?

Thanks

matt · April 15, 2018, 3:04am

If you don’t mind me asking: why do you want continuous backup? What benefits does it give you over a scheduled backup?

arikb · April 15, 2018, 9:51am

Well, two things:

When I work on a document or on a project, a day represents a lot of work. So the question is what is the optimum schedule… I’d be okay with an hourly backup (15 minutes is what CrashPlan provided), but along the continuum between a day and instant, the shorter the window is - the more significant the extra work of going through all the files becomes and the more helpful the inotify scenario becomes.
On (uncommon) occasions I found myself making some change by mistake that I wanted to undo, and I went back in time an hour or two to get a previous version. It’s not an absolute must, I rarely f’ up that bad.

764287 · April 15, 2018, 12:25pm

When it comes to backups there is no ‘one-size-fits-all’ approach. Everything depends on your data sets, workflow and storage capacity.

How much data do you want to backup? Do you have data sets that changes more often then others? Does deduplication work well on your data sets? What is your planned retention period?

If your data is small (i.e. a few GB) making backups every 15min is no problem and shouldn’t take more than 1min. If you have lots of data and only a subpart of it is changing often, then you should probably split the data into different data sets and adjust the time span between backups and the retention period to your needs.

arikb · April 15, 2018, 1:59pm

I have around 70GB I want to back up on this computer. The stuff that changes often is much smaller - around 1/3 of that.

If I’m smart about it, I can probably split that and run multiple jobs for the different type of data (for example my documents folder, project folder and browser profile would update very often).

But to be honest, while I can script the heck out of anything, sometimes I want a drop-in solution. I want backup to magically happen and out of the way, because I don’t want to manage it. It’s just another thing I can f’ up and the implications are serious. So I rather have a solution I point at my home directory and let loose (other than some really necessary excludes) than a solution that I have to customise not only once but every time I add a project or make changes to the way I work otherwise.

764287 · April 15, 2018, 3:00pm

That’s a relatively small amount of data and backups probably won’t take long. Just give it a try.

I absolutely understand that you don’t want to waste too much time on this. I didn’t want to say that different data sets are necessary, I simply wanted to suggest it as an option. Having 1 backup job to handle all the data is perfectly fine. After all only you know which data you have and how to handle it best.

arikb · April 15, 2018, 3:03pm

Yes, I’ve already started. It’s just that every invocation starts with around 1 minute of scanning my files again. I was hoping to cut that short by using the inotify interface.

matt · April 15, 2018, 3:13pm

It sounds like you want a sync service more than a backup service. Like, Dropbox or Google Drive. At least, that would be easier, no?

arikb · April 15, 2018, 3:32pm

Nope. I want a backup service:

I don’t need to sync to a different device
I want different versions of every file that I change, and access to files I deleted
I don’t want to pay sync prices, backup back-ends are much cheaper
I want the backup to be encrypted on the back end, most sync services don’t allow that

I was happy with CrashPlan, which had the continuous backup feature.

More concisely: I’m very happy with what restic offers. The only thing I’m missing is the ability to backup frequently without the penalty of scanning all files.

matt · April 15, 2018, 4:50pm

You have some good points, I can appreciate that.

It’s worth noting that you can use a sync service with a single device, and that most sync services I’ve used track changes and let you see different revisions of a file, including deleted ones.

You are right about backup storage usually being cheaper than sync storage, though.

For what it’s worth, you can tell restic to back up just one file. All you have to do is wrap inotify over restic and invoke backup on the file that changed.

askielboe · April 15, 2018, 5:40pm

In the next release of restic (currently on master) using the --quiet argument will skip the initial scan. So maybe give that a try when it’s released. It will still have to work through all your files of course, but as the initial scan is only used for progress estimation, it’s not needed when running in quiet mode.

Source: restic/changelog/unreleased/pull-1676 at 6eb1be0be477b4d9064f5c49558a4ca768dd54aa · restic/restic · GitHub

arikb · April 15, 2018, 11:17pm

That is exactly what I was asking - if I can integrate it with inotify somehow. It’s not just the ability to select files to backup - it’s also whether the high number of snapshots would impact performance over say years of operation.

Also, having snapshots of individual files will make it really difficult to forget snapshots - current forget mechanism rely on the frequency of snapshots. If I backup /home/user every hour, after 7 days I can forget all but the 1st one of each day for example. If I backup individual files - I get a snapshot containing only /home/user/file1 and another snapshot containing only /home/user/file2 etc. so I not only have to back individual files, I have to somehow create a new snapshot that contains all the files of the previous snapshot of the same root directory (I’m guessing restic uses pointers to existing blobs), except for the files that were changed.

Now that I had to write it down to explain it I realise that this will require code changes… Is there a way to suggest new features?

I will definitely use that one, thanks.

matt · April 17, 2018, 3:57pm

I haven’t tried, but isn’t there restic find for this sort of thing, to find a file in a recent snapshot? Because if that worked, and if having lots of little snapshots wasn’t a bad thing, you could use find to locate the file(s) you want and then restore from the snapshot it’s in.

You can open an issue on GitHub: Issues · restic/restic · GitHub.

arikb · April 17, 2018, 11:09pm

There is. But if you want to restore a folder where one file changed later, you have to know to restore the folder first, then the file. It can be a mess.

I was about to, but being a good boy I first searched for similar issues and found one, from a few months ago:

github.com/restic/restic

Daemon to monitor changes in real-time and run restic

opened 11:24PM - 24 Dec 17 UTC

alphapapa

category: backup type: feature suggestion

Having used CrashPlan and Obnam for a while, I'm very impressed by Restic. When… it gains compression (and optionally disabled encryption), I think it will be a great alternative (for the software, not the backend service). One of the nice things about CrashPlan is its real-time service that watches for changes to backup sets (using e.g. inotify on Linux) and runs the actual backup every so often, as configured. And one of the reasons I'm so impressed with Restic is how fast its actual backup phase is. When backing up a large set of files with few of them having changed, most of the time is spent in the scan phase (e.g. #1160). So, since Restic is so flexible about how it receives the list of paths to backup, it would be nice if there were a daemon that could run in the background and watch for changes to certain paths, then run Restic every so often on the files/directories that have changed. That could make Restic really fast and relatively lightweight for desktop backups. Since Restic is so flexible in this regard, I don't know if it would even be necessary for such a daemon to be specific to Restic. If not, there might already be some solutions available, in which case we don't need to do anything, but documenting it would be very helpful. :) What do you think? Thanks for your work on Restic!

The author pretty much follows through the same rationale that I did, complete with the interim and final conclusions in this post, like he read my mind before I even thought about it. Kudos @alphapapa. Looks like it was accepted as a feature request by @fd0.

alphapapa · April 26, 2018, 4:05am

Great minds…

Vartkat · June 28, 2018, 5:27pm

Your should take a look at fswatch to trigger restic on file change.

whereisaaron · July 24, 2018, 7:45pm

Since looking at restic I have had the same thoughts as @alphapapa and @arikb

My performance problem with restic is not the actually backing up of files or dedup or (lack of) compression, but the fact it spends 99% of its time pointlessly scanning 100,000’s of files to notice they haven’t changed since an hour ago That’s a lot of wasted effort and I/O that continuous solutions like CrashPlan and Carbonite avoid.

restic support backing up a list of specific files (--file-from). So my thought was to have a inotify wrapper/daemon that would build a list of modified files for 15-60 minutes, and the dispatch a restic backup with the accumulated file path list. Then once every 12-24 hours it would run restic with a full scan to ensure nothing is missed (inotify and NTFS events are not guaranteed, they use kernel memory and just get dropped if not picked up in time).

A continuous wrapper would reduce hourly restic backup times from ~30 minutes runtime to ~3 minutes. An greatly reduce the hourly I/O operations on the filesystem.

Has this been done already? Anyone seen a generic inotify wrapper that could do this or be adapted to do this?

I guess a shell script with inotifywait could do it for local filesystems.

askielboe · July 25, 2018, 11:51am

Just tried this as my home folder backups are very slow even for a few changed files.

Using fswatch on MacOS I simply piped the files to a logfile and used that as input to --files-from when backing up with restic.

Seems to work fine. Just make sure you’re not piping to the logfile while restic is running as restic only seems to check for file existence initially, and will fail if any other files are added by fswatch later, that don’t exist when restic tries to back them up (crash with lstat /path/to/file: no such file or directory). I just rotated the log and ran restic on the previous logfile.

I could not get restic to apply my exclude file using --exclude-file when used together with --files-from so for this to work you’d have to parse the logfile manually and remove stuff you don’t want. Actually the exclude file seems to work. It’s just that it still lists it in the snapshot (because it’s part of the backup command I guess), but the actual path is empty when I try to restore.

So it seems like a possible but maybe a bit of a hacky solution. Not sure it will scale to many files. Also your snapshots will look like crazy trying to list all the individual files backed up in the snapshot.

askielboe · July 25, 2018, 1:12pm

It’s definitely fast though. I’m a bit paranoid, so I back up my home folder once per hour. Normally this takes about 50 minutes for a full scan, plus upload. Using the fswatch hack above cut that down to 9 seconds for an hour’s worth of changes.

I know it’s not surprising given what’s happening. Probably just an indication that my current naive restic setup is not ideal.

The only thing preventing me from using it at this point is the horrible snapshot log I get… (Using restic snapshots --compact helps with this)

Edit: I should note that I have millions of data files that I backup. Some of them change often while others never change - so I guess I should only backup the static ones once in a while.

askielboe · July 25, 2018, 1:57pm

On second thought this wont work well because there’s no meaningful parent snapshot to use. So deduplication won’t work and finding a backed up version of a file will require you to look through all snapshots - not just the latest.