Backups hang when repository is unlocked

Hi all,

First off, I just want to say thank you for such an amazing tool. Restic really is incredible and so great at what it does!

I have run into an issue suddenly though where all of our repositories refuse to back up. At first I thought it had something to do with locks, but now, even with a completely unlocked repository, I am still unable to complete a backup.

Below is a reproduction case, for our repositories at least.

First, here’s what the repository’s locks directory looks like at the start:

username@hostname:/path/to/restic/repos# ll 95eb045f8f/locks
total 0
drwx------ 2 www-data www-data 128 Oct 26 18:49 ./
drwx------ 7 www-data www-data 200 Jul 15 04:13 ../
-rw------- 1 www-data www-data   0 Oct 26 18:49 5f0347f5a1ea8ed5d1b8e58371d7cc61bf16f51125f828d40dcb9f5bad21e1ad

If I try to run a backup with the repository in this state, here’s the output from it:

Load(<lock/aa5ebb6c2c>, 0, 0) returned error, retrying after 582.280027ms: unexpected HTTP response (416): 416 Requested Range Not Satisfiable

If I try to use the unlock command (with the --remove-all flag), the same thing happens.

If I then manually remove the locks:

username@hostname:/path/to/restic/repos# rm 95eb045f8f/locks/*
username@hostname:/path/to/restic/repos# ll 95eb045f8f/locks
total 0
drwx------ 2 www-data www-data  48 Oct 26 18:52 ./
drwx------ 7 www-data www-data 200 Jul 15 04:13 ../

The backup generates the following in the logs (we are using the REST server), but just hangs after that:

IP ADDRESS - - [26/Oct/2020:18:49:00 +0000] "HEAD /95eb045f8f/config HTTP/1.1" 200 0 "" "Go-http-client/1.1"
IP ADDRESS - - [26/Oct/2020:18:49:00 +0000] "GET /95eb045f8f/keys/ HTTP/1.1" 200 88 "" "Go-http-client/1.1"
IP ADDRESS - - [26/Oct/2020:18:49:00 +0000] "GET /95eb045f8f/keys/751d3042eb9cf36c8d10e810766e62e3d75a178025a4dcb925c4bd93e688b6a1 HTTP/1.1" 206 457 "" "Go-http-client/1.1"
IP ADDRESS - - [26/Oct/2020:18:49:01 +0000] "GET /95eb045f8f/config HTTP/1.1" 200 155 "" "Go-http-client/1.1"
IP ADDRESS - - [26/Oct/2020:18:49:01 +0000] "GET /95eb045f8f/locks/ HTTP/1.1" 200 4 "" "Go-http-client/1.1"

Any help as to why these commands are hanging ― even with a repository without any locks present ― would be much appreciated!

1 Like

Which restic version and rest-server version are you running? Please try it with the latest releases if you havent (0.10.0 for both of them).

When you have a seemingly hung restic process, you can send it a SIGABRT signal, which should make it print a stack trace and exit. Can you do that and paste the output?

Also please edit your intial post (the little pen icon) to include the complete command you run and the full output (except redundant output) (including obfuscated environment variables if you use such).

Thanks for the fast response. I’m unable to edit that post because it was created too long ago, but here’s a different reproduction case using a simpler command (snapshots) that also hangs.

Below is the behavior we are seeing.

First, here’s what the repository’s locks directory looks like at the start:

username@hostname:/path/to/restic/repos# ll 95eb045f8f/locks
total 0
drwx------ 2 www-data www-data 128 Oct 29 00:30 ./
drwx------ 7 www-data www-data 200 Jul 15 04:13 ../
-rw------- 1 www-data www-data   0 Oct 29 00:30 f1412effb701bfaf456b5db6f26cd2726ca5c8d15ecb9e569b75474dcfaea73d

I then run the following:

RESTIC_PASSWORD=password restic -r rest:https://95eb045f8f:userpassword@host:8000/95eb045f8f snapshots --no-cache

and I get the following:

repository 1493a4f0 opened successfully, password is correct
Load(<lock/f1412effb7>, 0, 0) returned error, retrying after 582.280027ms: unexpected HTTP response (416): 416 Requested Range Not Satisfiable

The last line is repeated a number of times until I kill the process.

I can unlock the repository using the following:

RESTIC_PASSWORD=password restic -r rest:https://95eb045f8f:userpassword@restic-01.next.tech:8000/95eb045f8f unlock --no-cache --remove-all

Which successfully removes the locks, based on the output:

repository 1493a4f0 opened successfully, password is correct
successfully removed locks

And the locks directory:

username@hostname:/path/to/restic/repos# ll 95eb045f8f/locks
total 0
drwx------ 2 www-data www-data  48 Oct 29 00:37 ./
drwx------ 7 www-data www-data 200 Jul 15 04:13 ../

If I run the snapshots command again, the following is shown in the Rest Server logs:

12.222.114.62 - - [29/Oct/2020:00:40:07 +0000] "HEAD /95eb045f8f/config HTTP/1.1" 200 0 "" "Go-http-client/1.1"
12.222.114.62 - - [29/Oct/2020:00:40:07 +0000] "GET /95eb045f8f/keys/ HTTP/1.1" 200 88 "" "Go-http-client/1.1"
12.222.114.62 - - [29/Oct/2020:00:40:07 +0000] "GET /95eb045f8f/keys/751d3042eb9cf36c8d10e810766e62e3d75a178025a4dcb925c4bd93e688b6a1 HTTP/1.1" 206 457 "" "Go-http-client/1.1"
12.222.114.62 - - [29/Oct/2020:00:40:07 +0000] "GET /95eb045f8f/config HTTP/1.1" 200 155 "" "Go-http-client/1.1"
12.222.114.62 - - [29/Oct/2020:00:40:07 +0000] "GET /95eb045f8f/locks/ HTTP/1.1" 200 4 "" "Go-http-client/1.1"

After those are shown in the logs, the process hangs. If I then send a SIGBART to it, I am given the following:

SIGABRT: abort
PC=0x46ce81 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex(0x158c588, 0x80, 0x0, 0x0, 0x0, 0x46ca5c, 0x1d3d4c, 0x3498183b, 0x7fffafb16cb8, 0x40beff, ...)
  /usr/local/go/src/runtime/sys_linux_amd64.s:587 +0x21
runtime.futexsleep(0x158c588, 0x0, 0xffffffffffffffff)
  /usr/local/go/src/runtime/os_linux.go:45 +0x46
runtime.notesleep(0x158c588)
  /usr/local/go/src/runtime/lock_futex.go:159 +0x9f
runtime.stopm()
  /usr/local/go/src/runtime/proc.go:1910 +0xc5
runtime.findrunnable(0xc00005b800, 0x0)
  /usr/local/go/src/runtime/proc.go:2471 +0xa7f
runtime.schedule()
  /usr/local/go/src/runtime/proc.go:2669 +0x2d7
runtime.park_m(0xc000330180)
  /usr/local/go/src/runtime/proc.go:2837 +0x9d
runtime.mcall(0x0)
  /usr/local/go/src/runtime/asm_amd64.s:318 +0x5b

goroutine 1 [select]:
net/http.(*persistConn).roundTrip(0xc0003a86c0, 0xc000212240, 0x0, 0x0, 0x0)
  /usr/local/go/src/net/http/transport.go:2565 +0x779
net/http.(*Transport).roundTrip(0xc0003c83c0, 0xc0004fa900, 0xc0005f0680, 0xc0005f0680, 0xe4d220)
  /usr/local/go/src/net/http/transport.go:582 +0xa65
net/http.(*Transport).RoundTrip(0xc0003c83c0, 0xc0004fa900, 0x1035000, 0xc0005f0680, 0xc00025e4b0)
  /usr/local/go/src/net/http/roundtrip.go:17 +0x35
github.com/restic/restic/internal/limiter.staticLimiter.roundTripper(0x0, 0x0, 0x10297c0, 0xc0003c83c0, 0xc0004fa900, 0x9dc407f53f29bf7d, 0xc0001a0cb8, 0x9d00000000450b07)
  /restic/internal/limiter/static_limiter.go:63 +0x68
github.com/restic/restic/internal/limiter.staticLimiter.Transport.func1(0xc0004fa900, 0xd82340, 0xc000636401, 0xc0002cac60)
  /restic/internal/limiter/static_limiter.go:78 +0x4f
github.com/restic/restic/internal/limiter.roundTripper.RoundTrip(0xc0004d2f00, 0xc0004fa900, 0xc0004d2f00, 0x0, 0x0)
  /restic/internal/limiter/static_limiter.go:52 +0x30
net/http.send(0xc0004fa800, 0x102a3e0, 0xc0004d2f00, 0x0, 0x0, 0x0, 0xc0000c6110, 0x0, 0x1, 0x0)
  /usr/local/go/src/net/http/client.go:252 +0x453
net/http.(*Client).send(0xc0004d2f30, 0xc0004fa800, 0x0, 0x0, 0x0, 0xc0000c6110, 0x0, 0x1, 0xc0001a0fd0)
  /usr/local/go/src/net/http/client.go:176 +0xff
net/http.(*Client).do(0xc0004d2f30, 0xc0004fa800, 0x0, 0x0, 0x0)
  /usr/local/go/src/net/http/client.go:718 +0x45f
net/http.(*Client).Do(...)
  /usr/local/go/src/net/http/client.go:586
golang.org/x/net/context/ctxhttp.Do(0x103ef00, 0xc0002121c0, 0xc0004d2f30, 0xc0004fa700, 0xc0002caa40, 0xea, 0x7ff5bc28dbf0)
  /home/build/go/pkg/mod/golang.org/x/net@v0.0.0-20200904194848-62affa334b73/context/ctxhttp/ctxhttp.go:27 +0x10f
github.com/restic/restic/internal/backend/rest.(*Backend).Save(0xc0004d2f90, 0x103ef00, 0xc0002121c0, 0xf06145, 0x4, 0xc000600480, 0x40, 0x103aa00, 0xc00025e470, 0x0, ...)
  /restic/internal/backend/rest/rest.go:131 +0x49a
github.com/restic/restic/internal/backend.(*RetryBackend).Save.func1(0xc0005f0620, 0x10317c0)
  /restic/internal/backend/backend_retry.go:56 +0xf9
github.com/cenkalti/backoff.RetryNotify(0xc0001a1498, 0x10317c0, 0xc0005f0620, 0xc0001a13f8, 0x0, 0x0)
  /home/build/go/pkg/mod/github.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:37 +0x210
github.com/restic/restic/internal/backend.(*RetryBackend).retry(0xc0005e0100, 0x103ef40, 0xc0000c4028, 0xc000172460, 0x17, 0xc0001a1498, 0x17, 0x10)
  /restic/internal/backend/backend_retry.go:36 +0x185
github.com/restic/restic/internal/backend.(*RetryBackend).Save(0xc0005e0100, 0x103ef40, 0xc0000c4028, 0xf06145, 0x4, 0xc000600480, 0x40, 0x103aa00, 0xc00025e470, 0x90, ...)
  /restic/internal/backend/backend_retry.go:50 +0x1b7
github.com/restic/restic/internal/repository.(*Repository).SaveUnpacked(0xc000242000, 0x103ef40, 0xc0000c4028, 0xf06145, 0x4, 0xc0006362d0, 0x8b, 0x90, 0x430493f48789a783, 0x7ec0c7281dad1437, ...)
  /restic/internal/repository/repository.go:309 +0x36e
github.com/restic/restic/internal/repository.(*Repository).SaveJSONUnpacked(0xc000242000, 0x103ef40, 0xc0000c4028, 0xf06145, 0x4, 0xe71ea0, 0xc000242070, 0x0, 0x0, 0x0, ...)
  /restic/internal/repository/repository.go:293 +0x145
github.com/restic/restic/internal/restic.(*Lock).createLock(0xc000242070, 0x103ef40, 0xc0000c4028, 0x0, 0x0, 0x0, 0x0, 0xdb73e0, 0x30)
  /restic/internal/restic/lock.go:167 +0x9f
github.com/restic/restic/internal/restic.newLock(0x103ef40, 0xc0000c4028, 0x104e620, 0xc000242000, 0xc000424e00, 0x10, 0x10, 0xdb73e0)
  /restic/internal/restic/lock.go:106 +0x1c5
github.com/restic/restic/internal/restic.NewLock(0x103ef40, 0xc0000c4028, 0x104e620, 0xc000242000, 0xdb73e0, 0x158c440, 0x7ff5e31d7108)
  /restic/internal/restic/lock.go:67 +0x4e
main.lockRepository(0xc000242000, 0x0, 0x0, 0x0, 0x0)
  /restic/cmd/restic/lock.go:35 +0x6a
main.lockRepo(...)
  /restic/cmd/restic/lock.go:22
main.runSnapshots(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
  /restic/cmd/restic/cmd_snapshots.go:64 +0x985
main.glob..func24(0x157c660, 0xc0004d2e10, 0x0, 0x3, 0x0, 0x0)
  /restic/cmd/restic/cmd_snapshots.go:29 +0xa8
github.com/spf13/cobra.(*Command).execute(0x157c660, 0xc0004e42c0, 0x3, 0x4, 0x157c660, 0xc0004e42c0)
  /home/build/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:842 +0x47c
github.com/spf13/cobra.(*Command).ExecuteC(0x157ce40, 0xc000424da0, 0x2, 0x8)
  /home/build/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
  /home/build/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887
main.main()
  /restic/cmd/restic/main.go:86 +0x3f

goroutine 18 [chan receive]:
github.com/restic/restic/internal/restic.init.0.func1.1()
  /restic/internal/restic/lock.go:258 +0xb3
created by github.com/restic/restic/internal/restic.init.0.func1
  /restic/internal/restic/lock.go:255 +0x35

goroutine 5 [syscall]:
os/signal.signal_recv(0x0)
  /usr/local/go/src/runtime/sigqueue.go:147 +0x9d
os/signal.loop()
  /usr/local/go/src/os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
  /usr/local/go/src/os/signal/signal.go:150 +0x45

goroutine 6 [chan receive]:
github.com/restic/restic/internal/restic.init.2.func1(0xc0000c82a0)
  /restic/internal/restic/progress_unix.go:17 +0x65
created by github.com/restic/restic/internal/restic.init.2
  /restic/internal/restic/progress_unix.go:16 +0xab

goroutine 8 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc00007cd80)
  /home/build/go/pkg/mod/go.opencensus.io@v0.22.4/stats/view/worker.go:276 +0x105
created by go.opencensus.io/stats/view.init.0
  /home/build/go/pkg/mod/go.opencensus.io@v0.22.4/stats/view/worker.go:34 +0x68

goroutine 9 [chan receive]:
main.CleanupHandler(0xc000049a40)
  /restic/cmd/restic/cleanup.go:59 +0x4c
created by main.init.0
  /restic/cmd/restic/cleanup.go:21 +0x67

goroutine 54 [IO wait]:
internal/poll.runtime_pollWait(0x7ff5bc51bb30, 0x72, 0x102af80)
  /usr/local/go/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc00007d618, 0x72, 0xc00034a000, 0x1865, 0x1865)
  /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
  /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc00007d600, 0xc00034a000, 0x1865, 0x1865, 0x0, 0x0, 0x0)
  /usr/local/go/src/internal/poll/fd_unix.go:159 +0x1b1
net.(*netFD).Read(0xc00007d600, 0xc00034a000, 0x1865, 0x1865, 0x203000, 0x626abb, 0xc000332160)
  /usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc000256048, 0xc00034a000, 0x1865, 0x1865, 0x0, 0x0, 0x0)
  /usr/local/go/src/net/net.go:182 +0x8e
crypto/tls.(*atLeastReader).Read(0xc0005f0480, 0xc00034a000, 0x1865, 0x1865, 0x90, 0x1860, 0xc0005a6668)
  /usr/local/go/src/crypto/tls/conn.go:779 +0x62
bytes.(*Buffer).ReadFrom(0xc000332280, 0x1027a80, 0xc0005f0480, 0x40b3e5, 0xde7240, 0xed28a0)
  /usr/local/go/src/bytes/buffer.go:204 +0xb1
crypto/tls.(*Conn).readFromUntil(0xc000332000, 0x1029720, 0xc000256048, 0x5, 0xc000256048, 0x7f)
  /usr/local/go/src/crypto/tls/conn.go:801 +0xf3
crypto/tls.(*Conn).readRecordOrCCS(0xc000332000, 0x0, 0x0, 0x0)
  /usr/local/go/src/crypto/tls/conn.go:608 +0x115
crypto/tls.(*Conn).readRecord(...)
  /usr/local/go/src/crypto/tls/conn.go:576
crypto/tls.(*Conn).Read(0xc000332000, 0xc000389000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
  /usr/local/go/src/crypto/tls/conn.go:1252 +0x15f
net/http.(*persistConn).Read(0xc0003a86c0, 0xc000389000, 0x1000, 0x1000, 0xc00003e240, 0xc0005a6c58, 0x405835)
  /usr/local/go/src/net/http/transport.go:1887 +0x77
bufio.(*Reader).fill(0xc0004fd800)
  /usr/local/go/src/bufio/bufio.go:101 +0x105
bufio.(*Reader).Peek(0xc0004fd800, 0x1, 0x0, 0x0, 0x1, 0x0, 0xc00003e360)
  /usr/local/go/src/bufio/bufio.go:139 +0x4f
net/http.(*persistConn).readLoop(0xc0003a86c0)
  /usr/local/go/src/net/http/transport.go:2040 +0x1a8
created by net/http.(*Transport).dialConn
  /usr/local/go/src/net/http/transport.go:1708 +0xcb7

goroutine 55 [select]:
net/http.(*persistConn).writeLoop(0xc0003a86c0)
  /usr/local/go/src/net/http/transport.go:2340 +0x11c
created by net/http.(*Transport).dialConn
  /usr/local/go/src/net/http/transport.go:1709 +0xcdc

rax    0xca
rbx    0x158c440
rcx    0x46ce83
rdx    0x0
rdi    0x158c588
rsi    0x80
rbp    0x7fffafb16c80
rsp    0x7fffafb16c38
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x3
r13    0x158bb20
r14    0x0
r15    0x0
rip    0x46ce81
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

I am using Restic 0.10.0 and Rest Server 0.10.0. I have tried with both HTTP and HTTPS connections. I should also mention that this server has about 4.6MM repositories and 71MM snapshots on it. I have a separate server with only a few thousand repositories on it that is working fine, but this larger server responds the same for all repositories.

Thanks again for the help.

Thanks a lot for a thorough problem report!

What has changed on the server and client in relation to the problem starting to happen? Is it something that came with 0.10.0 of either rest-server or restic? Can you try 0.9.6?

I wonder if it has anything to do with the same thing as the FUTEX issues at https://github.com/restic/restic/issues?q=is%3Aissue+is%3Aopen+futex .

Also, does the rest-server say anything out of the ordinary? Does it show any activity when restic seems stuck on the client?

Can you reproduce this with a completely new repository as well?

Thanks for all the help so far @rawtaz. I think I am making progress on this. It appears that this may be an implementation issue for repositories with an extremely large number of snapshots. I’m going to keep digging and will report back soon to indicate whether or not this is an issue that requires further research.

Thanks again!

1 Like

@scosta The stacktrace you’ve posted is stuck while trying to create a lock file. If this is still a problem then you could try to build a debug version of restic: https://github.com/restic/restic/blob/master/CONTRIBUTING.md#reporting-bugs

I assume the snapshots are distributed across the repositories? The rest-server shouldn’t care much about the total number, but only the number of snapshots for a repository. And for a single repository I wouldn’t recommend more than a few thousand snapshots.

@MichaelEischer thanks for the follow up and debugging info. That’s really helpful.

I think the large number of snapshots may indeed be the issue. Some of them are in the tens of thousands. I’m going to do some more digging to test that hypothesis.

I’ve had a few other projects come up since I posted this that I’m focusing on at the moment, so in the meantime if you would like to close it to declutter the forum, please feel free to do so!