Hello everyone,
Posting my experience, as requested: after many months and a lot of trouble, I’ve finally managed to successfully backup our local NAS server to our business GSuite Google Drive account using restic.
The source data resides in 3 ZFS pools totaling ~62 million files and ~26 TiB, and each pool is first replicated from the main server to a dedicated backup server (via zfs send | recv
) and then backed up from this backup server to the cloud using restic backup
for each individual pool (so we have a separate snapshot for each pool) all in a single restic repository.
It took almost nine months until we had all data on the cloud, due initially to not having a dedicated backup server with enough RAM to run restic, then due to insufficient internet bandwidth, but in the last couple of weeks due to restic backup
locking up intermittently (this happened a few times and unfortunately even with a lot of effort and time spent and a great deal of help from @fd0 (BTW, thanks for everything, @fd0!), we so far haven’t managed to even diagnose (much less fix) the root cause for those lock-ups; the solution was to kill the restic process and start over every time (fortunately not having to upload all of the data again thanks to restic’s deduplication).
Now we run restic backup
daily to update these backups, and it takes almost a full day (anywhere from 19 to 22h) for the three pools – we are trying to optimize this but with no success so far. All we can say is that it’s not CPU, memory, local disk bandwidth nor internet bandwidth that’s holding back restic – the issue seems to be internal to restic itself. We are continuing to try to resolve this, but in the short term we worked around it by moving the backup for the largest pool to be done only on friday nights – so it has the whole weekend to complete, and during the week the other two pools have more than enough time to finish.
The judge is still out on recovery – I managed to finish my first recovery today, after some more trouble, covering a relatively very small subset of the whole backup (less than 185K files using up 72GiB), but I found that two files had their content corrupted (on one of them, 13 bytes have apparently randomly changed values, and on the other 8 bytes were zeroed out). Perhaps this was not restic’s fault as I missed making a snapshot of the restored files right after restic finished, and the restore was to a network shared directory where conceivably something else could have messed with it – so I will try to repeat that on the near future, on a more controlled setting this time, so as to be able to pinpoint exactly where the corruption was introduced.
EDIT: the corruption seems to be real: I repeated the restore to a local protected directory and the exact same corruption is still showing; we’re trying to track down its cause, but so far it doesn’t look like hardware (the machine has ECC RAM and the source comes from ZFS). Very worrying… :-/
Overall, I’m very happy with restic – it has a great user and developer community, and even with the troubles I went through, it has enabled me to backup a volume of data to the cloud that would otherwise have been impossible.
Cheers,
– Durval.