One minute backup took 8.5 hours (0.9.1)

whereisaaron · July 24, 2018, 7:18pm

I am unsure what to make of the backup log below (bottom).

First here is a normal run of this backup. It is a mysqldump | restics --stdin type backup to B2. Normally it takes about ~1 minute. Zero/success exit code.

Files:           1 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     0 unmodified
Added:      1008.368 KiB

processed 1 files, 0 B in 1:15
snapshot 6f7c717d saved
2018-07-24: Back up of MySQL database ... completed successfully

But I have one run of this backup that last 8.5 hours! In that time it said it added just 100KB of data - probably that is an incomplete file. It exited with a non-zero exit code (error).

Files:           1 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     0 unmodified
Added:      105.428 KiB

processed 1 files, 0 B in 8:27:26
snapshot 88eb4a9d saved
2018-07-24: Back up of MySQL database [...] failed

What to make of this? It said ‘snapshot saved’ does that mean it worked? Or just that it drew a line under that failed snapshot? If it a is failed snaphshot, or a snapshot with errors, should the be a warning message?

There is no error message. But the extremely long run-time and the non-zero exit suggests something went wrong. Why is there no error message to go with the error code, like a time-out or closed pipe error message?

Why do both the succesful and failed backups say they ‘processed 1 file, 0 B’, while also saying added added 100KB-1000KB? That seems contradictory, or does ‘processed’ mean something else?

Attempting to complete this back-up for 8+ hours is excessive in this case. Is there, or should there be, a timeout setting or signal to get restic to bail cleanly? Or do I need to just kill the process and hope the best?

fd0 · July 24, 2018, 7:45pm

Hm, interesting. I’ve got no idea what’s going on. Sometimes b2 is strange, all error handling is implemented in the client, which needs to basically retry requests until one of the eventually succeeds.

When it says “snapshot […] saved” this means restic saved a snapshot successfully. I don’t have any idea why the overall process exited with a non-zero exit code. Maybe removing the lock from the repo failed? Can you check that the snashot 88eb4a9d is in the repo?

How large is the database? How large is the file in the repo? You could try checking with restic cat which blobs the sql dump file consists of…

No idea, sorry. What is the exit code exactly?

“processed” means “restic looked at the file”. In case of a backup read from stdin, it’s always exactly one file.

For most backends (except b2) there’s a final timeout for saving a file to the repository, but not with b2 since the client needs to retry. You can use the program timeout --signal INT to limit restic’s runtime and have it exit cleanly as if ^C was pressed, so the lock file is removed. Even if a backup process is aborted forcefully, that’ll only leave a bit of additional data in the repo which is cleaned up on the next run of prune.

So, overall the following reasons for the 8h runtime come to my mind:

B2 was very very very slow. Not really realistic, because apparently not much data is added, so there’s not much to upload
When reading data from stdin, restic needs to chunk and hash all data every time it reads it. That may have been slow when CPU resources were scarce. Also not so realistic, usually this runs in a minute…
Maybe the database had a lot to do, and the mysqldump delivered the data slowly

Can you reproduce this?

whereisaaron · July 26, 2018, 6:11am

Thanks @fd0.

I worked out the non-zero exit code could have come from mylsqldump and not restic as I was using set -o pipefail. So I would guess this was actually a successful run from the perspective of restic. That that mysqldump got suck for 8 hours and then disconnected or similar, and restic ended up backing up a partial file.

The database is about 7MB which compressed to 1MB and normally backs up to B2 in one minute. Not normally a busy database. I included a normal successful run log in my earlier message. This 8-hour backup has been a one-off so far, out of several hundred runs, so not common at all. I fully imagine the mysqldump could have failed.

Yes, snapshot 88eb4a9d is there with one file in it. I am trying to restore that file but that is not going well. I was able to build the local cache for the repository in ~1 hour. But the attempt to restore that snapshot has been running a couple of hours now with no result.

Because there are lot individual database backups, I have 3,668 snapshots in B2. When I run restic snaphosts (with no cache) it takes 40 minutes to get the list of snapshots and racks up ~7550 Class B and ~250 Class C API transactions. It built a 7MB local cache with ~4,500 files. The think the cache being so many small files really hurts the time to build the local cache against cloud storage (40 minutes and ~8000 API transactions to download 7MB of data is pretty out there!); But I guess at least one file per snapshot is inevitable.

3668 snapshots

real    41m59.728s
user    0m6.219s
sys     0m7.922s

Running it a second time took only only 16s and about 30 Class B and 30 Class C transactions. Again just showing how vital the caches it. May it also reflect that the local cache information is vital.

With the local cache in place, I tried to restore that one-file snapshot (restic restore -t . -vvv 88eb4a9d), but that is taking a looong time. Initially it increases the local cache to 22MB and ~9,500 files (~10,000 Class B API requests and ~700 Class C API requests), but then it sat there for two hours now not appearing to do much.

whereisaaron · July 26, 2018, 6:13pm

And the restore just finished. Three hours and 11 minutes to restore one 100KB file from B2 is longer than I was expecting!

It did successfully restore the 100KB sql.gz file and it is actually a complete and intact backup of this small 1MB database. So no idea why the backup got stuck for 8 hours, but I blame the mysqldump end of the pipe rather than restic.

$ time restic restore -t . -vvv 88eb4a9d
repository 88b2b1b2 opened successfully, password is correct
found 1 old cache directories in /home/aaron/.cache/restic, pass --cleanup-cache to remove them
restoring <Snapshot 88eb4a9d of [/root/database-20180723-2203.sql.gz] at 2018-07-23 22:03:08.736518641 +1200 +1200 by @myhost> to .
ignoring error for /database-20180723-2203.sql.gz: Lchown: lchown /home/aaron/foo/database-20180723-2203.sql.gz: operation not permitted
There were 1 errors

real    190m52.735s
user    0m7.906s
sys     0m8.625s

whereisaaron · July 26, 2018, 6:42pm

Running restic snapshots after another 150 revisions had been added, took ~2 minutes to update a local cache with the new info. No great number of B2 API transactions either.

3939 snapshots

real    1m44.780s
user    0m1.281s
sys     0m1.688s

And now I’ve done one restore, restoring the same 100KB file that took three hours before is down to only 8 seconds. Same for most other single-file snapshots.

$ time restic restore -t . -vvv d568ad93
repository 88b2b1b2 opened successfully, password is correct
found 1 old cache directories in /home/aaron/.cache/restic, pass --cleanup-cache to remove them
restoring <Snapshot d568ad93 of [/root/database-20180726-1803.sql.gz] at 2018-07-26 18:03:02.322112885 +1200 +1200 by @myhost> to .
ignoring error for /database-20180726-1803.sql.gz: Lchown: lchown /home/aaron/foo/database-20180726-1803.sql.gz: operation not permitted
There were 1 errors

real    0m8.132s
user    0m0.953s
sys     0m0.859s

It all just underscores how vital having the local cache data is. And also how expensive it is to rebuild a local cache from a remote repository. The expense is not the data size or not really the latency, it is the sheer number of very small file transactions that are needed.

If you are restoring in a different place from your backups, you should really copy the local cache from the back-up server and across to the restore server. Just coping that e.g. 20MB of data could take 2-3 hours and 10,000’s of API calls off your restore times.

Given the large ratio, would it work for restic clients with a local cache to archive that to a single file and store it on the repository every now and then. Then restic clients without a local cache could first download and unarchive that single file to bootstrap their local cache, and then bring it up to date with the usual process. Even if the repository-stored local cache archives were many days out of date, it would still save a couple hours for restoring (or backing up?) files from a new location.

whereisaaron · July 26, 2018, 7:14pm

@fd0 the lchown errors when restoring elsewhere are a tad annoying. Especially if restoring many files. I can’t find an option to not try to restore permissions? Is there a secret -o option that will do that?

fd0 · July 28, 2018, 7:37am

When we get around to changing the repo format, we’ll probably allow snapshot files being packed (just like normal data files), see [architecture] Performance of `restic snapshots` with high-latency remote · Issue #523 · restic/restic · GitHub

In addition to many small files being downloaded from a high-latency remote (B2), the process is also sequential and not concurrent. This will be addressed in new optimized multithreaded restore implementation by ifedorenko · Pull Request #1719 · restic/restic · GitHub

Yes, that would be possible, but at the expense of more complexity on the client. Complexity is something I’d like to avoid as much as possible. The local cache is really dumb, it’s just a mirror of the files in the repo, nothing more. restic checks in the cache first, if the file it looks for is not there, it’ll pull it from the server…

No, unfortunately not yet. There’s an issue about it: lchown operation not permitted · Issue #1766 · restic/restic · GitHub, the idea is to copy what cp and rsync do: ignore chown errors if not running as root and the error is EPERM.

whereisaaron · August 6, 2018, 12:44am

Thanks for the pointers @fd0 I’ve subscribed to those issues. In the meantime I might be able to use a wrapper to pack and upload/download the cache to B2 as a single file to bootstrap restores. The is especially important when restoring to a replacement environment, which probably won’t have the cache from the dead/failed environment.

Regards permissions; When you are restoring files, it will often not be to the original environment, which is quite possibly now dead and gone. The file permissions (especially groups and users) in the back-up may make no sense in the new environment. As such it is a pretty common restore option in backup software to restore file without trying to apply the original permissions. If you are restoring as root there is no easy way to avoid this sometimes unwanted behavior, and option to disable it would be welcome. Although I appreciate right now we simulate this result by deliberately restoring as a non-root user and then just ignoring all the errors it causes.

You mention copying rsync and cp behavior. The default behavior for rsync is not to recreate permissions, you have to enable that behavior with --perms or with --archive. Likewise the cp command also does not copy permissions by default, you have to enable that behavior with --preserve or --archive. So no need to ignore the errors, since you have the option with both tools not to create the errors in the first place