Hello there,
I am attempting to restore one .sql file from a Wasabi S3 hosted bucket, it’s approximately 16GB in size. I tried restoring using the standard restic implementation and now I’m testing using rclone. I previously restored files without any issues but now when I try to restore it takes ages. I left this restore running over last few days and it seems to have only restored around ~8GB of it.
I’m running CentOS 7, with the latest build of restic restic 0.9.6 compiled with go1.13.4 on linux/amd64 (from source). When I try to restore via rclone I get some logs indicating that my restore is going at around 10-25 kBytes/s. Also each time I perform a backup I prune at the end.
Any tips as to what may be causing this dip in performance?
After further lurking the forums for a solution I noticed a comment which mentioned that SMR HDD could impact performance, however this doesn’t seem to be the case for me. I decided it would be worth a shot restoring on a machine with an SSD instead and got the following results:
2020/09/15 14:39:36 DEBUG : S3 bucket example.com: GET /snapshots/xxxxx
2020/09/15 14:39:37 DEBUG : S3 bucket example.com: GET /data/xxxxx1
2020/09/15 14:39:42 INFO :
Transferred: 31.591M / 31.591 MBytes, 100%, 2.564 MBytes/s, ETA 0s
Errors: 0
Checks: 0 / 0, -
Transferred: 9918 / 9918, 100%
Elapsed time: 12.3s
2020/09/15 14:40:42 INFO :
Transferred: 31.591M / 31.591 MBytes, 100%, 2.564 MBytes/s, ETA 0s
Errors: 0
Checks: 0 / 0, -
Transferred: 9918 / 9918, 100%
Elapsed time: 12.3s
2020/09/15 14:41:42 INFO :
Transferred: 31.591M / 31.591 MBytes, 100%, 2.564 MBytes/s, ETA 0s
Errors: 0
Checks: 0 / 0, -
Transferred: 9918 / 9918, 100%
Elapsed time: 12.3s
2020/09/15 14:41:52 DEBUG : S3 bucket example.com: GET /data/xxxxx2
2020/09/15 14:42:42 INFO :
Transferred: 36.258M / 36.258 MBytes, 100%, 2.387 MBytes/s, ETA 0s
Errors: 0
Checks: 0 / 0, -
Transferred: 9919 / 9919, 100%
Elapsed time: 15.1s
2020/09/15 14:43:42 INFO :
Transferred: 36.258M / 36.258 MBytes, 100%, 2.387 MBytes/s, ETA 0s
Errors: 0
Checks: 0 / 0, -
Transferred: 9919 / 9919, 100%
Elapsed time: 15.1s
2020/09/15 14:44:36 DEBUG : S3 bucket example.com: GET /data/xxxxx3
2020/09/15 14:44:42 INFO :
Transferred: 40.625M / 40.625 MBytes, 100%, 2.367 MBytes/s, ETA 0s
Errors: 0
Checks: 0 / 0, -
Transferred: 9920 / 9920, 100%
Elapsed time: 17.1s
Which seems to have slightly improved my performance, but I have a gigabit connection so I was expecting a much better performance. When running ‘top’ it seems that restic is working hard in the background as well with a 100% CPU core usage. Has anyone had a similar sort of issue? My bucket has around 40k+ snapshots.
Maybe you’re affected by the restore perf bug?
opened 08:52AM - 13 Sep 20 UTC
closed 01:13PM - 13 Sep 20 UTC
Output of `restic version`
--------------------------
restic 0.9.6 compiled wi… th go1.13.4 on linux/amd64
How did you run restic exactly?
-------------------------------
For the backup:
```
export AZURE_ACCOUNT_KEY=…
export AZURE_ACCOUNT_NAME=itbackup
export RESTIC_REPOSITORY="azure:pmdb-restic:/"
export RESTIC_PASSWORD="…"
restic backup \
--verbose \
/mnt/snap /etc /root
```
For the restore:
```
restic restore -vvv latest --target /srv/mysql
```
What backend/server/service did you use to store the repository?
----------------------------------------------------------------
Azure's S3 and Google's S3
Expected behavior
-----------------
The latest snapshot is restored in a timely manner.
Actual behavior
---------------
This specific respository restored extremely slowly.
Instead of getting 60 MB/s to disk it varies between 1 MB/s and 10 MB/s.
I can observe that the number of restored files grows very very slowly. Between zero and three files per second. Although the files are just a few MB in size.
Steps to reproduce the behavior
-------------------------------
Unfortunately the only way to reproduce the issue is using a backup of a MySQL data directory that I am not allowed to disclose.
Do you have any idea what may have caused this?
-----------------------------------------------
I have spent countless hours trying to nail down the issue. Among those:
* try different virtual machines with different amounts of CPU cores (up to 8) and RAM (up to 32 GB)
* try HDD and SSD
* copy over the S3 repostory from Azure S3 to Google S3 and try the restore there
* measure disk performance using ```iostat -d 10 /dev/sdc```
* measure network, CPU and disk performance using ```nmon```
* monitor the network usage using ```iftop```
* check the CPU usage using ```htop```
I tried restoring another snapshot in the same environments that was taken from a PostgreSQL data directory. That snapshot restores quickly.
After many systematic tests I am sure that it happens with this specific Restic repository.
The S3 container at Azure is located in the same data center as other repositories that do not show performance issues.
I also tried reading the snapshot without writing to disk using ```restic check --read-data```. But I found it hard to properly see the issue here. I mainly see that Restic is writing to /root/.cache/restic with 40-50 MB/s. That's why I suspected a slow disk at first but even on an SSD the restore speed goes down to 0-2 MB/s.
Next I compared the 'stats' for two repositories:
```
Good and fast PostgreSQL data directory snapshot:
Total File Count: 111612
Total Size: 123.104 GiB
Very slow MySQL data directory snapshot:
Total File Count: 11608
Total Size: 300.373 GiB
```
(This is not related to MySQL or PostgreSQL of course. It's just for me to dinstinguish the snapshots.)
Also noticeable is that the restore performance starts well but gradually degrades. This is not a scientific measurement but a "reproducible impression". This is the network speed for downstream traffic from S3:
* 600 Mbps at the start
* 500 Mbps at 10 minutes later
* 400 Mbps at 15 minutes
* 250 Mbps at 20 minutes
* 100 Mbps at 30 minutes
* 60 Mbps at 60 minutes
This is the output of ```iostat -d 10 /dev/sdc``` (disk performance in 10-second averages):
```
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdc 6.67 0.19 2370.07 7865 100743840
sdc 0.00 0.00 0.00 0 0
sdc 0.20 0.00 14.80 0 148
sdc 348.55 2.40 1854.15 24 18560
sdc 148.10 0.00 800.40 0 8004
sdc 0.40 0.00 28.80 0 288
sdc 61.10 0.00 26938.80 0 269388
sdc 96.50 0.00 45624.80 0 456248
sdc 123.70 0.00 58082.00 0 580820
sdc 46.90 0.00 21996.40 0 219964
sdc 40.00 0.00 19978.40 0 199784
sdc 76.20 0.00 37994.00 0 379940
sdc 4.00 0.00 342.80 0 3428
sdc 135.46 0.00 67148.85 0 672160
sdc 36.80 0.00 18074.40 0 180744
sdc 139.10 10.40 16331.60 104 163316
sdc 133.00 0.00 47563.60 0 475636
sdc 28.70 0.00 13872.40 0 138724
sdc 50.80 0.00 23439.20 0 234392
sdc 54.50 0.00 27154.00 0 271540
sdc 49.10 0.00 23638.40 0 236384
sdc 21.20 0.00 10422.00 0 104220
sdc 29.70 0.00 14709.20 0 147092
sdc 51.90 0.00 25535.20 0 255352
sdc 15.70 0.00 7685.60 0 76856
sdc 24.30 0.00 11848.40 0 118484
sdc 45.40 0.00 22531.20 0 225312
sdc 12.40 0.00 4949.60 0 49496
sdc 25.70 0.00 12697.20 0 126972
sdc 22.00 0.00 10816.80 0 108168
sdc 13.50 0.00 6642.80 0 66428
sdc 43.10 0.00 20210.00 0 202100
sdc 15.00 0.00 6974.40 0 69744
sdc 2.60 0.00 990.21 0 9912
sdc 52.10 0.00 24994.00 0 249940
sdc 8.90 0.00 4153.20 0 41532
sdc 1.90 0.00 496.80 0 4968
sdc 53.90 0.00 25932.40 0 259324
sdc 11.40 0.00 5422.00 0 54220
sdc 31.37 0.00 15681.52 0 156972
sdc 21.90 0.00 10575.60 0 105756
sdc 38.40 0.00 18154.00 0 181540
sdc 33.00 0.00 16350.00 0 163500
sdc 28.70 0.00 13724.40 0 137244
sdc 39.20 0.00 18215.20 0 182152
sdc 19.18 0.00 9082.12 0 90912
sdc 0.40 0.00 13.20 0 132
sdc 34.20 0.00 16855.60 0 168556
sdc 8.70 0.00 3072.80 0 30728
sdc 9.00 0.00 4035.20 0 40352
sdc 16.40 0.00 7517.20 0 75172
sdc 12.70 0.00 4382.80 0 43828
sdc 14.50 0.00 6881.20 0 68812
sdc 20.90 0.00 9936.40 0 99364
sdc 4.60 0.00 1943.20 0 19432
sdc 10.40 0.00 3683.60 0 36836
sdc 18.90 0.00 8140.00 0 81400
sdc 8.29 0.40 3619.98 4 36236
sdc 11.40 0.00 4186.40 0 41864
sdc 9.10 0.00 4058.40 0 40584
sdc 14.10 0.40 6185.20 4 61852
sdc 7.20 0.00 2853.60 0 28536
sdc 27.90 0.40 11121.60 4 111216
sdc 55.40 0.80 26985.20 8 269852
sdc 0.70 0.00 96.80 0 968
sdc 23.20 0.40 11268.80 4 112688
sdc 44.80 0.80 21332.00 8 213320
sdc 20.40 0.00 9988.40 0 99884
sdc 11.50 0.40 5428.00 4 54280
sdc 79.70 1.60 38252.00 16 382520
sdc 10.50 0.00 4990.40 0 49904
sdc 8.40 0.40 3934.80 4 39348
sdc 20.60 0.00 9941.60 0 99416
sdc 64.10 1.20 30054.40 12 300544
sdc 17.50 0.00 8505.60 0 85056
sdc 2.50 0.00 722.40 0 7224
sdc 55.00 1.20 25952.00 12 259520
sdc 13.00 0.00 5307.20 0 53072
sdc 0.80 1.20 35.20 12 352
sdc 0.40 0.00 10.00 0 100
sdc 42.80 0.80 20206.80 8 202068
sdc 7.60 0.40 3092.40 4 30924
sdc 1.30 0.00 318.00 0 3180
sdc 34.60 0.40 15734.80 4 157348
sdc 6.90 0.40 3198.00 4 31980
sdc 3.40 0.00 1413.60 0 14136
sdc 37.80 0.80 18456.40 8 184564
sdc 5.10 0.00 706.00 0 7060
sdc 13.30 0.00 6031.20 0 60312
sdc 27.30 0.40 12977.60 4 129776
sdc 2.30 0.00 22.40 0 224
sdc 5.70 0.00 2522.80 0 25228
sdc 17.58 0.00 8431.17 0 84396
sdc 13.20 0.00 6356.40 0 63564
sdc 7.80 0.00 2941.20 0 29412
```
I think it is apparent that the disk write performance is fluctuating wildly. I have stopped now after 20 minutes which just gave me 14 GB of restored data.
This machine currently has 8 CPU cores and two different SSDs – one for the restore and one for the root file system (where /root/.cache/restic is living). It looks like the root SSD is more busy than the actualy restore SSD. Neither CPU nor network nor disks seem to be used a lot.
As ```restic restore``` does not tell me anything about the progress I just used "df -i" to count the inodes on the target partition. After 10-20 minutes into the restore the number of files barely increases:
```
root@db-restore-test:~# while true ; do sleep 1 ; df -i /srv/mysql/ ; done
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sdc 163840000 111962 163728038 1% /srv/mysql
/dev/sdc 163840000 111962 163728038 1% /srv/mysql
/dev/sdc 163840000 111962 163728038 1% /srv/mysql
/dev/sdc 163840000 111962 163728038 1% /srv/mysql
/dev/sdc 163840000 111963 163728037 1% /srv/mysql
/dev/sdc 163840000 111963 163728037 1% /srv/mysql
/dev/sdc 163840000 111964 163728036 1% /srv/mysql
/dev/sdc 163840000 111964 163728036 1% /srv/mysql
/dev/sdc 163840000 111966 163728034 1% /srv/mysql
/dev/sdc 163840000 111966 163728034 1% /srv/mysql
/dev/sdc 163840000 111966 163728034 1% /srv/mysql
/dev/sdc 163840000 111966 163728034 1% /srv/mysql
/dev/sdc 163840000 111967 163728033 1% /srv/mysql
/dev/sdc 163840000 111967 163728033 1% /srv/mysql
/dev/sdc 163840000 111967 163728033 1% /srv/mysql
/dev/sdc 163840000 111969 163728031 1% /srv/mysql
```
Next I ran ```strace``` on the restic process to get an idea what it is doing. I grepped for "open" and found that over a long period of time the same files are appended to. Example:
```
root@db-restore-test:~# strace -fp 7669 2>&1 | grep open | grep tar
[pid 7690] openat(AT_FDCWD, "/srv/mysql/root/mysql-server_5.7.22-1debian9_amd64.deb-bundle.tar", O_WRONLY|O_APPEND|O_CLOEXEC) = 38
[pid 7678] openat(AT_FDCWD, "/srv/mysql/root/mysql-server_5.7.22-1debian9_amd64.deb-bundle.tar", O_WRONLY|O_APPEND|O_CLOEXEC <unfinished ...>
[pid 7681] openat(AT_FDCWD, "/srv/mysql/root/mysql-server_5.7.22-1debian9_amd64.deb-bundle.tar", O_WRONLY|O_APPEND|O_CLOEXEC <unfinished ...>
[pid 7681] openat(AT_FDCWD, "/srv/mysql/root/mysql-server_5.7.22-1debian9_amd64.deb-bundle.tar", O_WRONLY|O_APPEND|O_CLOEXEC) = 38
[pid 7683] openat(AT_FDCWD, "/srv/mysql/root/mysql-server_5.7.22-1debian9_amd64.deb-bundle.tar", O_WRONLY|O_APPEND|O_CLOEXEC <unfinished ...>
[…]
```
Maybe I'm reading the strace output wrong. But it seems that Restic is juggling a lot of files in parallel and only writing very little to them. So I used ```lsof``` to see how many files it has open:
```
root@db-restore-test:~# lsof | grep /srv/mysql | wc -l
768
```
I know, this is a lot of data to digest. I tried to make my steps as clear as possible. If you want me to try anything else just let me know. Thanks in advance.
Do you have an idea how to solve the issue?
-------------------------------------------
No.
Did restic help you today? Did it make you happy in any way?
------------------------------------------------------------
Totally. Two friends and I are using Restic on a daily basis to backup our cheap personal VMs to Wasabi S3 and it never let us down. We love Restic.
I’m not sure if it was related to this, as the speed of my restore was incredibly slow through out the entire restore run while this issue mentions that it progressively slows down. It seems that the fix for this was already merged into master and I tried running a fresh build of master which didn’t seem to resolve the issue. However, when I was lurking yesterday I came across:
restic:master
← ifedorenko:out-of-order-restore-no-progress
opened 01:39AM - 03 Mar 19 UTC
And while this issue has already been merged as well it gave me an idea to try the latest beta binary version (just like the person in the issue you mentioned). I used: https://beta.restic.net/restic-v0.9.6-377-gc34c7316/ which seemed to resolve the issue I was having.
Is there a public beta branch for restic? or is that only accessible by devs?
rawtaz
September 15, 2020, 9:38pm
5
The latest master builds that you already found contains the latest merged restic code, so isn’t that pretty much what you’re asking for here?
But does it also contain any changes that are present on the beta releases? I was slightly confused as I couldn’t figure out what the version on the beta release related to in terms of commits. I went into the commit history of master branch to see if I can find a version relating to:
restic-v0.9.6-377-gc34c7316
But could not find anything relating to this, which lead me to believe there may have been a private Beta repo which gets merged into master, and those version relate directly to the Beta branch.
Do you know by any chance how the versioning there works?
I also compiled the latest master branch along the way and tested my restore using that which seemed to still operate slowly, in hindsight I should’ve deleted any previous version prior to testing, just in case it was still using the old version. I’m gonna try that and report back, may have just been me being an idiot this whole time.
I figured out the versioning and can confirm I’m an idiot. Looks like when performing testing it seems to have used the older version of restic not the fresh one I compiled.
1 Like
rawtaz
September 16, 2020, 10:17am
8
It’s simply a direct build of the master branch. So everything that’s been merged will be in it.
If you grab the commit hash from the last part of the filename, but without the initial g
, in your case the result would be c34c7316, you can look for that commit hash in the restic repository, e.g. https://github.com/restic/restic/commit/c34c7316 .
I have no idea what that g is about, maybe @fd0 can tell us?