Viewing snapshots ++ is very slow on GDrive

denNorske · July 30, 2020, 11:05am

Hello community! I am new here, cool to be around.

I have recently sat up my own gsuite and wanted to utilize the benefit of unlimited google drive storage. I am super happy i bumped into restic + rclone to help me take snapshots of my files, and upload them encrypted to google drive. I am also super happy that you that develop and maintain the project, ended up to not discouraging people from using google drive.

I am aware that Google does have some rate limits on their uploads. 750 GB / day, max 2 concurrent uploads etc. For me, it’s just fine.

But, I notice some really slow processing when I try to list existing snapshots in my repo. I have around 1.7 TB with files backed up, and restic -r [repo] snapshots takes 25-30 minutes to load for me.

Are there any ways I can speed up the listing? I also notice that making new snapshots takes a lot of time. Does it scan through all files once before it starts the process of backing up? Can I configure something to make it go a bit quicker?

Thanks for your help.

rawtaz · July 30, 2020, 11:27am

Hello and welcome!

Which restic version are you using?

Can you please try the latest master build and see if it makes any difference?

Also make sure to use the latest rclone when you try the latest restic build.

Finally, how many snapshots do you have in the repository?

denNorske · July 30, 2020, 12:09pm

Hi Rawtaz!

Thank you for the reply -

root@big-cookie:~# rclone version
rclone v1.52.2
- os/arch: linux/amd64
- go version: go1.14.4
root@big-cookie:~# restic version
restic 0.9.6 compiled with go1.13.4 on linux/amd64
root@big-cookie:~#

I have tried the latest restic build as well with indifferent results.

I do have 5 snapshots at the moment.

rawtaz · July 30, 2020, 12:51pm

I haven’t checked the code in particular, but listing snapshots shouldn’t take this long IMO. But I never used GDrive so I have no point of reference. Perhaps others that do can chime in and share their experience?

Have you checked what CPU, disk and network activity there is during the time that it runs? Especially the restic and rclone processes of course.

What’s your internet connection?

When restic runs a backup, it scans the set of files it was asked to back up. If you previously backed up the exact same set of files (the list of files/directories you give on the command line to the backup command) then restic should use the previous snapshot for those files as the “parent snapshot”. Please check your output during backup for something like “using parent snapshot XXX” to verify. If it finds a parent snapshot, it will compare metadata in order to know which files has changed and thereby needs scanning to know which parts of them to upload.

If on the other hand you give restic a different list of files/directories to back up, it will not find a matching parent snapshot, can not compare metadata and will then have to scan through all the files to know which parts of them to upload.

Regardless, restic will only upload what has changed.

denNorske · July 30, 2020, 3:47pm

Hi again!

I have checked the network, disk and cpu activity after launching the snapshots command, and figured out that the load average increased by 0.15-0.20, there is a very slight bit of disk requests, and there’s data being downloaded and sent at a total of approx 15-40 kb/s

I am on a 1000/1000 mbps line

EDIT:

I notice the API is being used a lot. Shouldn’t snapshots be indexed somehow instead of scanning through all files on the remote?

den

MichaelEischer · August 1, 2020, 4:13pm

Each snapshot currently is represented by a small file containing the snapshot metadata. restic snapshots should just list and read these files in the snapshots folder of the backup repository. In addition to that restic also needs to read the repository config file and find the key containing the actual encryption key. So I’d assume that to be around 10 API requests (for 5 snapshots). restic also caches the snapshot files locally (unless disabled) so a later run should require even fewer requests.

Slow runs of the snapshots command also lead to slow startup of backup runs as these have to look for a snapshot which can serve as starting point for the backup run.

Can you run restic list keys and restic list locks to see how many files these folders contain? The usual would be one key file and one lock file (that is one line of output).

denNorske · August 1, 2020, 5:23pm

Thank you for your reply Michael!

I ran the commands as requested, and they both took 20 minutes each to complete.
You were right, i had 1 key and 1 lock.

Does it matter if i have a long passkey or not? It’s currently 131 characters long.

den

MichaelEischer · August 2, 2020, 8:30am

The length of the passkey shouldn’t influence the overall duration of the commands by more than a few nanoseconds . restic always transforms passkeys into a 256 bit key (in a process that takes around half(?) a second) which is then used to decrypt the keyfile which contains the actual encryption key. From that point on the passkey is no longer actively used.

You could try to set the environment variable GODEBUG="http2debug=2" which (hopefully) shows what’s going on. Note that this log will probably contain lots of sensitive information like access tokens or parts of the (encrypted) repository data. The only lines which are relevant for debugging this would be those which contain on of the following words: “:status”, “:method” or “:path”.

If that doesn’t work, we’ll need a debug build of restic to see whats going on. (Or do you have another way to inspect the API calls issued by restic?)

denNorske · August 2, 2020, 10:20am

Hi again!

I added the env var and ran the command for a little while. I did not let it complete, but you can clearly see that there’s a bunch of locks-requests.

I have attached the gist here, and i attempted to remove any sensitive information.

gist.github.com

https://gist.github.com/dennorske/e8e2a5e04c9b289e776fbf46da89fd7c

Restic + rclone log

rclone: 2020/08/02 12:03:54 http2: server connection from stdio on 0xc0005ec460
rclone: 2020/08/02 12:03:54 http2: server: client stdio said hello
rclone: 2020/08/02 12:03:54 http2: server read frame SETTINGS len=18, settings: ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=4194304, MAX_HEADER_LIST_SIZE=10485760
rclone: 2020/08/02 12:03:54 http2: server processing setting [ENABLE_PUSH = 0]
rclone: 2020/08/02 12:03:54 http2: server processing setting [INITIAL_WINDOW_SIZE = 4194304]
rclone: 2020/08/02 12:03:54 http2: server processing setting [MAX_HEADER_LIST_SIZE = 10485760]
rclone: 2020/08/02 12:03:54 http2: server read frame WINDOW_UPDATE len=4 (conn) incr=1073741824
rclone: 2020/08/02 12:03:54 http2: decoded hpack field header field ":authority" = "localhost"
rclone: 2020/08/02 12:03:54 http2: decoded hpack field header field ":method" = "GET"
rclone: 2020/08/02 12:03:54 http2: decoded hpack field header field ":path" = "/file-5577006791947779410"

This file has been truncated. show original

MichaelEischer · August 2, 2020, 12:15pm

The log shows that restic spends a long time with reading hundreds of lock files. I’m kinda surprised that restic list locks did not list those. Can you run restic list --no-lock locks again and check how many lock file names it reports (the --no-lock option should speed up the command a lot)?

You should be able to remove these old lock files with restic unlock. If that doesn’t help then try restic unlock --remove-all, but make sure that no other restic instance is running in parallel.

denNorske · August 2, 2020, 2:59pm

There’s like 3 pages of lock files now when i tried, and it was almost instant with the --no-lock flag.

May I understand what the locks do? I guess they are present to prevent from conflicting instances if you run multiple instances. And then the question, why aren’t they removed?
I intend to only run one instance with my rclone, every 12 hours for 1 machine.

Can i run --no-lock on my backup and snapshot commands as well without problems? Are there any other dangers doing this?

MichaelEischer · August 2, 2020, 3:15pm

The --no-lock option has no effect for commands that modify the repository. I’d also advise against using that flag unless there’s a good reason why it is not possible to use locks, like a readonly backend.

The lock files ensure that concurrent operations on a repository don’t interfere with each other. These serve primarily as a mean to prevent conflicts between backup runs which add data to the repository and prune/rebuild-index which remove data. Without these lock files the conflicting operations could severely damage the repository.

There are two main reasons why lock files are not removed: Either restic is terminated before it has shutdown properly or the backend temporarily failed to delete a lock file. The latter case currently triggers a bug where restic creates a new lock file every five minutes but fails to cleanup old lock files (see https://github.com/restic/restic/issues/2715). If you’re hit by that specific problem, then the backup command will print lots of complaints about failed lock file deletions.

My recommendation would be to use unlock to cleanup the accumulated lock files manually and then keep an eye on whether the problem resurfaces.

Edit:
Does your usage of snapshots read the whole output of restic before processing it further? Or could it maybe read only half of the output and then stop reading it? In that case restic would probably fail to remove the lock file.

denNorske · August 2, 2020, 4:17pm

Thank you again Michael, this is very nice information to know.

I’ll keep an eye on the locks - and it may sound like the bug is hitting me - as I’ve been using it for just 4-5 days at the creation of this topic. That means around 8-10 runs only. However, my cron logs does not show any errors as i can see.

What do you mean by the last edit of yours? It reads my snapshots correctly with “snapshots” now, it just took a lot of time without the --no-lock

jedi453 · August 2, 2020, 7:46pm

Hi denNorske and welcome to the forums!

Did you use a custom API key when you setup rclone to connect to Google Drive? If you just did the basic rclone setup, you’ll be left with a connection that is so slow it’s borderline unusable. This is not really a fault of restic or rclone, it’s just that Google has refused (as of yet) to upgrade the throughput of ncw’s keys which are used by default to access Google Drive.

The rclone docs discuss this here: https://rclone.org/drive/#making-your-own-client-id

If you already did this, make sure restic isn’t set to use an rclone remote without the custom API key. (I’ve made that mistake myself before.)

Good luck,
jedi453

P.S. I love the avatar! (Doo-bee-doo-bee-doo-op!)

denNorske · August 2, 2020, 9:27pm

Hi Jedi! Thanks for that! Perry is my fav

I am pretty sure it uses the correct API key as I have all the stats for my api usage popping up on the console project

I am afraid that isn’t the issue. I think it is the lock(s) issue that causes the delays, but thanks for the good point!

MichaelEischer · August 3, 2020, 6:27pm

The particular bug which I have in mind, will always produces lots of warning messages when it’s triggered. So if you’ve don’t see the warnings, then it’s probably something else (although I’m not sure what that would be). Does calling restic snapshots (without additional parameters) now just take a few seconds as it’s supposed to?

I thought about some shell command like restic snapshots | head -n2 (well, that example is rather useless) which will only read a part of the output of restic. That could restic to exit rather abruptly and prevent the lock from being deleted.

denNorske · August 4, 2020, 11:12am

I gotcha!
No, there was no abrupt stop in the listing, but just very slow.

Now, my backups take 5-10 mins to complete, and my snapshots are almost instantly listed.

I will keep an eye on the locks and logs for the errors/warnings in the future, I’ll consider this solved for now.

Many thanks to everyone involved!
den

cdhowie · August 4, 2020, 4:12pm

I’m not sure that’s the case. We’d need to look at the code. My understanding is that the snapshot list is fully downloaded before any output happens, because the list of snapshots displayed is sorted. In that case, restic could stringify the list of snapshots in advance, delete the lock, then dump the stringified list.

And, of course, if restic gracefully handles SIGPIPE then it doesn’t matter either way.

MichaelEischer · August 4, 2020, 8:04pm

The repository is usually unlocked by a deferred call to unlock. And therefore the lock file is only removed after printing the snapshots.

Regarding the sigpipe handling, there’s even an issue for that

github.com/restic/restic

Handle SIGPIPE gracefully

opened 05:42PM - 03 Dec 17 UTC

FiloSottile

type: bug

The SIGPIPE issue in #1457 will occur for all backends, it's just that some are.….. less reliable than others ;) The problem is that registering a SIGPIPE handler removes the Go runtime logic that only crashes on SIGPIPE on fd 0/1/2, and not on SIGPIPE caused by i.e. TCP connection resets. https://github.com/golang/go/blob/78074f6850c34a955d69f578e363d1d3f3e00e25/src/os/signal/doc.go#L91-L110 This was reverted in 2579fe6b7bd55f341f333f4d2dab7f3a539652a5, but for a solution that also fixes #1413, you can check the file descriptor like the runtime does, and ignore for > 2. Also, I think the fix in 2579fe6b7bd55f341f333f4d2dab7f3a539652a5 should have its own CHANGELOG line and maybe even a minor version release, as I basically can't finish a backup to B2 without a TCP reset. (Some decent background reading is also https://stackoverflow.com/questions/108183/how-to-prevent-sigpipes-or-handle-them-properly)