[SOLVED] Restic High Load > backup killed


#1

Since I’m using the latest restic the backups have a hard time completing. The load on the CentOS server becomes so high that the script is eventually killed:

Starting Restic File Backup at 2018-06-02_10-34-54
/etc/sensu/plugins/resticBackup.sh: line 5: 3185 Killed

The sensu monitoring service triggers the backup and monitors if all went well. This has worked wonderfully for 0.8 for months, but since 0.9 there are troubles. Is there some kind of way to limit the speed or resources restic is using? My whole server becomes unreachable, which never happened with 0.8


#2

On second thought, this also happens with 0.8.3 after I tried it with that. it’s probably because it tries to make a new full backup. The question remains, is there a way to limit resources?


#3

I’m astonished that it’s worse with 0.9.0, which should be much easier on the system resources.

The first step here is to find out what the limiting resource is. Why is restic killed? Have a look at the kernel log, e.g. by using dmesg, is the out-of-memory killer terminating restic?

We’re aware that restic does use a lot of memory right now. You can tune the aggressiveness of the Go garbage collector by setting e.g. GOGC=20, usually this means a bit less memory usage.


#4

Well I’ve downloaded the 0.9.0-20 as a try as well, but when I simply run it on the command line with -v it stops updating the output at around 1050 files, 220mb.

dmesg | grep restic
[91254.008339] [ 9363] 0 9363 21879 18364 47 0 0 restic
[91254.008399] Out of memory: Kill process 9363 (restic) score 70 or sacrifice child
[91254.009632] Killed process 9363 (restic) total-vm:87516kB, anon-rss:73456kB, file-rss:0kB, shmem-rss:0kB

Where do I define GOGC=20? in the environment file as export GOGC=20?

I think throttling the speed would fix things, I don’t mind being it slow(er), it’s a night job while I sleep. The server has SSD disks so that might even speed up the process so much that the rest can’t keep up.


#5

Yes, exactly. You would type export GOGC=20 and then run restic.
More info on that if you’re interested here https://golang.org/pkg/runtime/


#6

Well it did affect it positively somewhat, but now gives output once a minute at
[3:31] 2100 files 106.633 MiB, total 15926 files 445.481 MiB, 0 errors

The lower the GOGC value, the slower it goes, can I set it at for example 5?

Same issue, but 3 minutes later
[92211.830228] Out of memory: Kill process 2759 (restic) score 50 or sacrifice child
[92211.831187] Killed process 2759 (restic) total-vm:66856kB, anon-rss:53000kB, file-rss:0kB, shmem-rss:0kB


#8

Ok succes report:

I’ve set GOGC to 1, which means 1% I read in your linked page.
The server load is between 4 and 6, which to me is very acceptable. I want it to be responsive and that’s the case with this load.

My backup result:
open repository
repository a21e0319 opened successfully, password is correct
lock repository
load index files
start scan on [/FILES]
start backup on [/FILES]
scan finished in 126.250s: 26752 files, 879.208 MiB
uploaded intermediate index 2c5b0fae
uploaded intermediate index fa060b26
uploaded intermediate index 6acaf52e

Files:       26752 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     0 unmodified
Data Blobs:  20458 new
Tree Blobs:      1 new
Added:      750.646 MiB

processed 26752 files, 879.208 MiB in 8:48
snapshot b03c6692 saved

Please note, I’m using a 1gig mem server with no swap and just 1 core in my production environment. The reason for this is that I don’t want to scale unless everything runs fine on this smaller machine.

Your GOGC setting did the trick, by setting it even lower everything seems fine now. Maybe this is something others can make use off as well :slight_smile:

Thank you for your help


#9

Unless I’m reading the numbers wrong, this means that restic was killed for a memory usage of well below 100MiB. You will very likely run into more problems with less than 512MiB of RAM, the scrypt key derivation function restic uses to process the password alone will (temporarily) use at least 60MiB.

Did you restrict restic on the server to such a low amount of memory? Or are there other processes taking the remaining memory of 1GiB?


#10

I have not put a restriction on restic’s usage, it’s an active webserver so yes, there are other processes taking up memory. I think the main problem was the load on the server, it went well over 22 which resulted in many processes (php, nginx) no longer working. My guess is that a flag to set the speed would be great to fine-tune per server environment.


#11

I don’t understand where the load comes from, besides “load” is not a great indicator of what’s going on. With which version of restic was that exactly? Which type of “load” are we talking about (CPU, IO, memory)? If it’s CPU load, how about using nice? If it’s IO load, why not use ionice?

All are honest questions, I’m trying to understand what happened here and how we can possibly improve the situation within restic (if it cannot be addressed with standard tools)…


#12

Please note that these problems arise when a backup runs for the first time, basically when there’s a lot of new files to deal with.

It might have to do with the fact that I also have a glusterfs mount active that serves the directory that I am backing up. Even though I access the local storage instead of the network mount, I still think it is detecting activity and thus goes to work, although not as bad as accessing the network mount itself. Although I’m not 100% sure this is the case.

The thing is, CPU Usage is quite low during backup, memory usage is not the problem either (I think), it’s the load, there’s just so much to do at once that everything gets queued up and the cpu just can’t keep up and other programs die cause of that extensive load and everything becomes unresponsive…

To my surprise and short linux experience: high load is not the same as high cpu usage. I’m also in an openstack environment where the server is influenced by outside factors (neighbors or other factors) and although this shouldn’t be the case, it is in practice. Software probably behaves more as expected in a non-virtual environment.

To be fair, the backups are running great again now that I lowered the GOGC and I really don’t want to change anything now. I will fall back to nice and ionice when I run into problems again.

I’m using 0.9.0.20 now, but to be honest, the restic version didn’t really matter, there was high load anyway before the gogc setting was changed.