I try to understand how Restic handles changing files (by the users) during backups. What are the effects i.e. when backing up large mailboxes and new mails come in or mails are being moved by users during backups? Is this a problem?
Yes it is a problem. Restic does not handle it and you can end up with not consistent backup.
This is user responsibility to ensure that files are not changing during backup. The easiest way is to backup filesystem snapshots instead of live filesystem. Either VSS on Windows, APFS snapshot on macOS or BTRFS/ZFS snapshots on other systems.
You can find on Internet or even this forum multiple examples how to achieve it e.g.:
Restic is not a full backup solution taking care of any aspect of backup process. It is fantastic program but requires a bit of extra effort to implement some bits and pieces like file system snapshots. Given that it is cross platform tool it would be rather difficult to cover all possible OS and file systems quirks.
To clarify; This is not a restic specific issue - any software that reads files that changes while they’re being read is susceptible to the same issue. And the solution is snapshots or similar. This a generic issue and solution.
On a related note; Snapshots might not be enough - you should consider things like databases and other software and whether they write stuff in an atomic way or not. Sometimes you will find that it’s a good idea to do a database dump to have a pure export of the database in case of needing to restore it. Again, generic, not restic specific.
indeed. my wrong. I have not made it clear enough that it is not resting specific challenge.
Any database/VM/or frequently changed files might require special approach - regardless of backup software used.
For example to backup consistent database files it is needed to “freeze” database state on disk. It can be very brief if filesystem snapshots are used or much longer if all backup has to run against “frozen files”.
For such cases sometimes it is beneficial to use specialized backup software - which is aware of behaviour files it is backing up.
As an example I use restic to backup running VMs - it is all done by shell script which first suspends running VM to disk - takes file system snapshot - starts VM again (VM downtime less than 1 min) - and then uses restic to backup file system snapshot - which can take as much time as needed with VM running all the time now.
Thanks for your answer! To make a snapshot and then doing a backup from the snapshot sounds good, but I am not able to make snaphots on my system.
So my next question will we, how significant these backup inconsistencies really are. Example: when Restic runs a backup from 8:00 h to 8:45 h, it scans through all the files intended for backup during those 45 minutes. In a scenario a new email arrives at 8:30 h, right in the middle of the backup process, and Restic has already scanned the mailbox folder. In this case, the new email won’t be included in the backup, even though the backup concludes at 8:45 h. But the new email is not lost because in the next backup run, scheduled for the following day, it will be included.
So, what’s the inconsistency here? It seems to be that we can’t definitively claim that the backup precisely represents the state of our data at 8:00 h. During those 45 minutes, it’s a bit of a window in time where it’s uncertain whether new, changed, or deleted files will make their way into the backup.
Would you agree? This would be something I can live with.
(I am not talking about backing up live databases here. I will backup a database dump only.)
I am afraid there is no clear answer for your question. For sure what you will try to do sounds like playing lottery - will your backup be of any use if restored? Depends on your email program - will it just miss one email or welcome you with nice message that email database is corrupted and you should restore it from the latest good backup… which by some luck you might have from previous runs (but only maybe)
If your email program is your concern you should consult its creators/community what is the best approach.
Other easy option is just shut it down for backup run. You can use your OS scripting to automate it and run at the least disruptive time.
I agree that this is hardly an actual practical problem for you.
Are you sure that your (non-initial) backups really take 45 minutes to run/complete? You must have tons of files or an extremely slow connection or a system that has very slow I/O or something. My backup, which is of around 100 GB and a lot of files, takes just one minute to run.
Please know that restic uses the previous backup snapshot for the same data set (the paths you back up) as a reference and only re-reads the files that are indicated to actually have changed since that last snapshot. See Backing up — restic 0.16.0 documentation for more information.
@kapitainsky yes it’s a lottery, but the same lottery like for the time between the last backup and a crash. If I have a daily backup run and my system crashes 12 hours after the backup run, all my files that are created or changed since the last run are lost. I could run backup every (half) hour, to minimize the risk, but there will be always that risk. I am not talking about an email “database” but a folder file structure which holds the emails.
@rawtaz I have 500 GB of mailboxes on a server. I currently use incremental backups with Duplicity with full backups every month. Full backups take some hours, incrementals some minutes only. I want to change to Restic (or any other dedup solution) to avoid the time consuming full backups in the future.
If you want to keep your email server always running and make consistent backup then very good approach would be to export all users’ mailboxes to temporary location and use restic to back it up to some safe place.
As an example on Free BSD/Linux running mail server with sendmail and dovecot you can use tool called doveadm and below scripted logic:
get a list of all mail users
loop through mail users list, running: doveadm backup -u $mail_user maildir:"/path/to/backup/$mail_user"
for each user
You can add extra files to your backup e.g. mail server config etc. Anything you might need to restore server if needed.
use restic to backup all mailboxes dump - now it does not matter if backup takes 1min or 10h
No server downtime is required to have 100% valid data.
I guess this method should work with any email server - details and commands/tools needed will differ but logic remains the same.
I’m not sure what the point of your post is considering it should be obvious to everyone that we are discussing backing up and restoring.
If it takes five minutes instead of one hour to run the backup, then clearly this removes the majority of the OP’s concerns, simply because of a smaller time window for potential inconsistencies. They still need to use a snapshot or similar to back up their system properly, but a smaller time window is better than a long one.
I’ve simply excluded files that are open and can cause problems, i.e. cache files, especially browser caches.
I’m pretty sure a smarter person then me would get the open files first i.e. with lsof in linux and create an exclude list on the go.
The real hard part is how to for example backup a database and in that case I’m using mysqldump and I backup the dump file and not the database files themselves.