When I work on a document or on a project, a day represents a lot of work. So the question is what is the optimum schedule… I’d be okay with an hourly backup (15 minutes is what CrashPlan provided), but along the continuum between a day and instant, the shorter the window is - the more significant the extra work of going through all the files becomes and the more helpful the inotify scenario becomes.
On (uncommon) occasions I found myself making some change by mistake that I wanted to undo, and I went back in time an hour or two to get a previous version. It’s not an absolute must, I rarely f’ up that bad.
When it comes to backups there is no ‘one-size-fits-all’ approach. Everything depends on your data sets, workflow and storage capacity.
How much data do you want to backup? Do you have data sets that changes more often then others? Does deduplication work well on your data sets? What is your planned retention period?
If your data is small (i.e. a few GB) making backups every 15min is no problem and shouldn’t take more than 1min. If you have lots of data and only a subpart of it is changing often, then you should probably split the data into different data sets and adjust the time span between backups and the retention period to your needs.
I have around 70GB I want to back up on this computer. The stuff that changes often is much smaller - around 1/3 of that.
If I’m smart about it, I can probably split that and run multiple jobs for the different type of data (for example my documents folder, project folder and browser profile would update very often).
But to be honest, while I can script the heck out of anything, sometimes I want a drop-in solution. I want backup to magically happen and out of the way, because I don’t want to manage it. It’s just another thing I can f’ up and the implications are serious. So I rather have a solution I point at my home directory and let loose (other than some really necessary excludes) than a solution that I have to customise not only once but every time I add a project or make changes to the way I work otherwise.
That’s a relatively small amount of data and backups probably won’t take long. Just give it a try.
I absolutely understand that you don’t want to waste too much time on this. I didn’t want to say that different data sets are necessary, I simply wanted to suggest it as an option. Having 1 backup job to handle all the data is perfectly fine. After all only you know which data you have and how to handle it best.
In the next release of restic (currently on master) using the --quiet argument will skip the initial scan. So maybe give that a try when it’s released. It will still have to work through all your files of course, but as the initial scan is only used for progress estimation, it’s not needed when running in quiet mode.
That is exactly what I was asking - if I can integrate it with inotify somehow. It’s not just the ability to select files to backup - it’s also whether the high number of snapshots would impact performance over say years of operation.
Also, having snapshots of individual files will make it really difficult to forget snapshots - current forget mechanism rely on the frequency of snapshots. If I backup /home/user every hour, after 7 days I can forget all but the 1st one of each day for example. If I backup individual files - I get a snapshot containing only /home/user/file1 and another snapshot containing only /home/user/file2 etc. so I not only have to back individual files, I have to somehow create a new snapshot that contains all the files of the previous snapshot of the same root directory (I’m guessing restic uses pointers to existing blobs), except for the files that were changed.
Now that I had to write it down to explain it I realise that this will require code changes… Is there a way to suggest new features?
I haven’t tried, but isn’t there restic find for this sort of thing, to find a file in a recent snapshot? Because if that worked, and if having lots of little snapshots wasn’t a bad thing, you could use find to locate the file(s) you want and then restore from the snapshot it’s in.
There is. But if you want to restore a folder where one file changed later, you have to know to restore the folder first, then the file. It can be a mess.
I was about to, but being a good boy I first searched for similar issues and found one, from a few months ago:
The author pretty much follows through the same rationale that I did, complete with the interim and final conclusions in this post, like he read my mind before I even thought about it. Kudos @alphapapa. Looks like it was accepted as a feature request by @fd0.
My performance problem with restic is not the actually backing up of files or dedup or (lack of) compression, but the fact it spends 99% of its time pointlessly scanning 100,000’s of files to notice they haven’t changed since an hour ago That’s a lot of wasted effort and I/O that continuous solutions like CrashPlan and Carbonite avoid.
restic support backing up a list of specific files (--file-from). So my thought was to have a inotify wrapper/daemon that would build a list of modified files for 15-60 minutes, and the dispatch a restic backup with the accumulated file path list. Then once every 12-24 hours it would run restic with a full scan to ensure nothing is missed (inotify and NTFS events are not guaranteed, they use kernel memory and just get dropped if not picked up in time).
A continuous wrapper would reduce hourly restic backup times from ~30 minutes runtime to ~3 minutes. An greatly reduce the hourly I/O operations on the filesystem.
Has this been done already? Anyone seen a generic inotify wrapper that could do this or be adapted to do this?
I guess a shell script with inotifywait could do it for local filesystems.
Just tried this as my home folder backups are very slow even for a few changed files.
Using fswatch on MacOS I simply piped the files to a logfile and used that as input to --files-from when backing up with restic.
Seems to work fine. Just make sure you’re not piping to the logfile while restic is running as restic only seems to check for file existence initially, and will fail if any other files are added by fswatch later, that don’t exist when restic tries to back them up (crash with lstat /path/to/file: no such file or directory). I just rotated the log and ran restic on the previous logfile.
I could not get restic to apply my exclude file using --exclude-file when used together with --files-from so for this to work you’d have to parse the logfile manually and remove stuff you don’t want. Actually the exclude file seems to work. It’s just that it still lists it in the snapshot (because it’s part of the backup command I guess), but the actual path is empty when I try to restore.
So it seems like a possible but maybe a bit of a hacky solution. Not sure it will scale to many files. Also your snapshots will look like crazy trying to list all the individual files backed up in the snapshot.
It’s definitely fast though. I’m a bit paranoid, so I back up my home folder once per hour. Normally this takes about 50 minutes for a full scan, plus upload. Using the fswatch hack above cut that down to 9 seconds for an hour’s worth of changes.
I know it’s not surprising given what’s happening. Probably just an indication that my current naive restic setup is not ideal.
The only thing preventing me from using it at this point is the horrible snapshot log I get… (Using restic snapshots --compact helps with this)
Edit: I should note that I have millions of data files that I backup. Some of them change often while others never change - so I guess I should only backup the static ones once in a while.
On second thought this wont work well because there’s no meaningful parent snapshot to use. So deduplication won’t work and finding a backed up version of a file will require you to look through all snapshots - not just the latest.