Restic +Alpine Container + Cron hangs (epoll_pwait)

I was trying to run a backup job periodically as a cron job. After initialsing the repo everthing is running correctly if I start my backup script manually (stripped down to a minimal example):

    #!/bin/bash
    export 'RESTIC_REPOSITORY=local:/tmp'
    export 'RESTIC_PASSWORD=foo'
    restic backup /etc

but running the same script as a cron job results in a endless hang with output of:

    crond[1]: USER root pid  44 cmd /bin/cronscript
    repository 318a33a7 opened successfully, password is correct
    crond[1]: user root: process already running: /bin/cronscript
    crond[1]: user root: process already running: /bin/cronscript

Steps to reproduce:
Build and use a Dockerfile like this:

    FROM restic/restic
    RUN  apk --no-cache add strace
    RUN echo "* * * * * /bin/cronscript"  |crontab - && \
        echo "#!/bin/ash" > /bin/cronscript && \
        echo "export 'RESTIC_REPOSITORY=local:/tmp'" >> /bin/cronscript && \
        echo "export 'RESTIC_PASSWORD=foo'"  >> /bin/cronscript && \
        echo "restic backup /etc" >> /bin/cronscript && \
        chmod a+x /bin/cronscript
    RUN export RESTIC_REPOSITORY=local:/tmp && export RESTIC_PASSWORD=foo && restic init
    ENTRYPOINT crond -f

When I take a look with strace at the running restic process I get only this output:

    / # ps aux
    PID   USER     TIME  COMMAND
        1 root      0:00 crond -f
       13 root      0:00 sh
       44 root      0:00 {cronscript} /bin/ash /bin/cronscript
       45 root      0:06 restic backup /etc
       55 root      0:00 ps aux
    / # strace -p 45
    strace: Process 45 attached
    clock_gettime(CLOCK_MONOTONIC, {tv_sec=360441, tv_nsec=77852481}) = 0
    clock_gettime(CLOCK_MONOTONIC, {tv_sec=360441, tv_nsec=80305781}) = 0
    epoll_pwait(3, [], 128, 6127, NULL, 0)  = 0
    clock_gettime(CLOCK_MONOTONIC, {tv_sec=360447, tv_nsec=216650881}) = 0
    clock_gettime(CLOCK_MONOTONIC, {tv_sec=360447, tv_nsec=216957881}) = 0
    epoll_pwait(3, [], 128, 9991, NULL, 0)  = 0
    clock_gettime(CLOCK_MONOTONIC, {tv_sec=360457, tv_nsec=218621681}) = 0
    clock_gettime(CLOCK_MONOTONIC, {tv_sec=360457, tv_nsec=218896281}) = 0
    epoll_pwait(3, ^Cstrace: Process 45 detached
     <detached ...>

Note:
(1) This example has been reduced to a minimum to present this strange behavior. My real script is actually much more complex and uses a remote restic rest-server as backup location. I reduced complexity and found out that this occurs only on alpine. With a Ubuntu docker image everything is okay. But I really would like to run this with alpineā€¦especially with the ā€˜restic/resticā€™ image.
(2) You have to start the complied Dockerfile like this to be able to use strace:
docker run -ti --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rt

I also tried using the normal ā€œalpineā€ docker imageā€¦still restic hangs.
Furthermore I changed the content of the cron job to an other command, that executes/exited normally.

So this problem seems to be related on using restic + alpine + crond!??

has anyone an idea what night cause restic to hang?

Hm, no idea whatā€™s going on here. Just a suggestion: the Go runtime uses multiple OS threads, so youā€™ll need to run strace -f -y -p <pid> to catch all threads (-f) and add the file names to file descriptors if possible (-y). Maybe thatā€™ll help to see something?

thanx fd0 for looking into this :slight_smile:
I tried again to run strace with you propesed flags and the output reads:

[pid    29] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid    28] ioctl(1</dev/pts/0>, TIOCGPGRP <unfinished ...>
[pid    29] <... clock_gettime resumed>{tv_sec=397021, tv_nsec=229663381}) = 0
[pid    29] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid    28] <... ioctl resumed>, [1])   = 0
[pid    29] <... clock_gettime resumed>{tv_sec=397021, tv_nsec=230488781}) = 0
[pid    29] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid    28] getpgid(0)                  = 27
[pid    28] ioctl(1</dev/pts/0>, TIOCGPGRP <unfinished ...>
[pid    29] <... nanosleep resumed>NULL) = 0
[pid    29] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid    28] <... ioctl resumed>, [1])   = 0
[pid    29] <... clock_gettime resumed>{tv_sec=397021, tv_nsec=234374381}) = 0
[pid    28] getpgid(0 <unfinished ...>
[pid    29] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid    28] <... getpgid resumed>)      = 27
[pid    29] <... clock_gettime resumed>{tv_sec=397021, tv_nsec=235091781}) = 0
[pid    28] ioctl(1</dev/pts/0>, TIOCGPGRP <unfinished ...>
[pid    29] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid    28] <... ioctl resumed>, [1])   = 0
[pid    28] getpgid(0 <unfinished ...>
[pid    29] <... nanosleep resumed>NULL) = 0
[pid    28] <... getpgid resumed>)      = 27
[pid    29] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid    28] ioctl(1</dev/pts/0>, TIOCGPGRP <unfinished ...>
[pid    29] <... clock_gettime resumed>{tv_sec=397021, tv_nsec=237755781}) = 0
[pid    28] <... ioctl resumed>, [1])   = 0
[pid    29] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid    28] getpgid(0 <unfinished ...>
[pid    29] <... clock_gettime resumed>{tv_sec=397021, tv_nsec=238508681}) = 0
[pid    28] <... getpgid resumed>)      = 27
[pid    29] getpid()                    = 28
[pid    28] ioctl(1</dev/pts/0>, TIOCGPGRP <unfinished ...>
[pid    29] tgkill(28, 28, SIGURG)      = 0
[pid    28] <... ioctl resumed>, [1])   = 0
[pid    29] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid    28] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=28, si_uid=0} ---

It seems to be stuck in this loop right after restic stated:
repository 70fbba4f opened successfully, password is correct
Increasing the verbosity of restic does not change the outcome.

One thing that is odd is that the PIDs statet in the strace can not be seen running ps:

[pid    29] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid    35] <... getpgid resumed>)      = 27
[pid    35] ioctl(1</dev/pts/0>, TIOCGPGRP <unfinished ...>
[pid    29] <... nanosleep resumed>NULL) = 0
[pid    29] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
^C[pid    35] <... ioctl resumed>, [1])   = 0
strace: Process 28 detached
strace: Process 29 detached
strace: Process 30 detached
strace: Process 31 detached
strace: Process 32 detached
strace: Process 33 detached
strace: Process 34 detached
strace: Process 35 detached
strace: Process 36 detached
strace: Process 37 detached

/ # ps aux
PID   USER     TIME  COMMAND
    1 root      0:00 crond -f -d 8
   11 root      0:00 sh
   27 root      0:00 {cronscript} /bin/ash /bin/cronscript
   28 root      7:16 restic backup --no-cache -v -v /etc
   49 root      0:00 ps aux

Maybe someone that understands straces better than me can help hereā€¦:wink:

Hm, so it does something before and then waits for the result. You could try [building restic with debug(Participating ā€” restic 0.16.3 documentation) support and write a debug log. This will contain a detailed list of things restic does. Maybe we can see something there.

Thatā€™s likely because (by default) ps doesnā€™t print the process IDs of threads created by programs, just the main PID itself.

You could also send a SIGABRT signal to restic. Then the go runtime will abort and print a stacktrace of all running goroutines. That would help with debugging the hang.

Thank you Michael for your idea. The output states:

repository 70fbba4f opened successfully, password is correct
SIGABRT: abort
PC=0x46d7c0 m=0 sigcode=0

goroutine 0 [idle]:
runtime.epollwait(0x7fff00000003, 0x7fff7103d230, 0x270800000080, 0x0, 0x2708, 0x450325, 0x0, 0x15d07c0, 0x3, 0x0, ...)
	/usr/local/go/src/runtime/sys_linux_amd64.s:725 +0x20
runtime.netpoll(0x2539784fc, 0x19926604cda01)
	/usr/local/go/src/runtime/netpoll_epoll.go:126 +0x92
runtime.findrunnable(0xc00003a000, 0x0)
	/usr/local/go/src/runtime/proc.go:2434 +0x785
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:2669 +0x2d7
runtime.park_m(0xc000000180)
	/usr/local/go/src/runtime/proc.go:2837 +0x9d
runtime.mcall(0x0)
	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b

goroutine 1 [chan receive]:
gopkg.in/tomb%2ev2.(*Tomb).Wait(0xc0000b6fa0, 0x0, 0x0)
	/home/build/go/pkg/mod/gopkg.in/tomb.v2@v2.0.0-20161208151619-d5d1b5820637/tomb.go:126 +0x4a
main.glob..func2(0x1632b00, 0xc0004fc240, 0x1, 0x4, 0x0, 0x0)
	/restic/cmd/restic/cmd_backup.go:76 +0x219
github.com/spf13/cobra.(*Command).execute(0x1632b00, 0xc0004fc1c0, 0x4, 0x4, 0x1632b00, 0xc0004fc1c0)
	/home/build/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:826 +0x47c
github.com/spf13/cobra.(*Command).ExecuteC(0x1635a80, 0xc000451fa0, 0x4, 0x10)
	/home/build/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x30b
github.com/spf13/cobra.(*Command).Execute(...)
	/home/build/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
main.main()
	/restic/cmd/restic/main.go:98 +0x3f

goroutine 6 [chan receive]:
github.com/restic/restic/internal/restic.init.0.func1.1()
	/restic/internal/restic/lock.go:258 +0xb3
created by github.com/restic/restic/internal/restic.init.0.func1
	/restic/internal/restic/lock.go:255 +0x35

goroutine 17 [syscall]:
os/signal.signal_recv(0x0)
	/usr/local/go/src/runtime/sigqueue.go:147 +0x9d
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
	/usr/local/go/src/os/signal/signal.go:150 +0x45

goroutine 18 [chan receive]:
github.com/restic/restic/internal/restic.init.2.func1(0xc0000303c0)
	/restic/internal/restic/progress_unix.go:17 +0x65
created by github.com/restic/restic/internal/restic.init.2
	/restic/internal/restic/progress_unix.go:16 +0xab

goroutine 33 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0000a8480)
	/home/build/go/pkg/mod/go.opencensus.io@v0.22.4/stats/view/worker.go:276 +0x105
created by go.opencensus.io/stats/view.init.0
	/home/build/go/pkg/mod/go.opencensus.io@v0.22.4/stats/view/worker.go:34 +0x68

goroutine 34 [chan receive]:
main.CleanupHandler(0xc000095920)
	/restic/cmd/restic/cleanup.go:59 +0x4c
created by main.init.0
	/restic/cmd/restic/cleanup.go:21 +0x67

goroutine 35 [runnable]:
syscall.Syscall(0x10, 0x1, 0x540f, 0xc000104d30, 0x0, 0xc000104d30, 0x0)
	/usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
golang.org/x/sys/unix.ioctl(0x1, 0x540f, 0xc000104d30, 0x0, 0x0)
	/home/build/go/pkg/mod/golang.org/x/sys@v0.0.0-20201007082116-8445cc04cbdf/unix/zsyscall_linux.go:41 +0x4c
golang.org/x/sys/unix.IoctlGetInt(...)
	/home/build/go/pkg/mod/golang.org/x/sys@v0.0.0-20201007082116-8445cc04cbdf/unix/ioctl.go:60
github.com/restic/restic/internal/ui/termstatus.isProcessBackground(0x1, 0x7d22746f6f722201, 0xc0001ae450, 0x0)
	/restic/internal/ui/termstatus/background_linux.go:21 +0x4b
github.com/restic/restic/internal/ui/termstatus.IsProcessBackground(0x1, 0xc000104e0c)
	/restic/internal/ui/termstatus/background_linux.go:12 +0x2b
github.com/restic/restic/internal/ui/termstatus.(*Terminal).run(0xc000095f20, 0x10b5480, 0xc0001ae440)
	/restic/internal/ui/termstatus/status.go:99 +0x5e5
github.com/restic/restic/internal/ui/termstatus.(*Terminal).Run(0xc000095f20, 0x10b5480, 0xc0001ae440)
	/restic/internal/ui/termstatus/status.go:86 +0x70
main.glob..func2.1(0x0, 0x0)
	/restic/cmd/restic/cmd_backup.go:69 +0x6d
gopkg.in/tomb%2ev2.(*Tomb).run(0xc0000b6fa0, 0xc0004c5120)
	/home/build/go/pkg/mod/gopkg.in/tomb.v2@v2.0.0-20161208151619-d5d1b5820637/tomb.go:163 +0x38
created by gopkg.in/tomb%2ev2.(*Tomb).Go
	/home/build/go/pkg/mod/gopkg.in/tomb.v2@v2.0.0-20161208151619-d5d1b5820637/tomb.go:159 +0xbc

goroutine 12 [select]:
main.refreshLocks(0x1642060, 0xc00035a420)
	/restic/cmd/restic/lock.go:70 +0x1c5
created by main.lockRepository
	/restic/cmd/restic/lock.go:47 +0x275

rax    0xfffffffffffffffc
rbx    0x2708
rcx    0x46d7c0
rdx    0x80
rdi    0x3
rsi    0x7fff7103d230
rbp    0x7fff7103d830
rsp    0x7fff7103d1e0
r8     0x0
r9     0x0
r10    0x2708
r11    0x246
r12    0x3
r13    0x1643140
r14    0x63
r15    0x400
rip    0x46d7c0
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

Running the compiled debug-version I can see this output:

repository/index.go:378	repository.(*Index).generatePackList.func1	1	handle blob 684b840ab5cdfab7a38c6e2fbb904e30413c693681d4bdf6ae5dd1ed4d364ac8
repository/index.go:378	repository.(*Index).generatePackList.func1	1	handle blob 06dc3d1a4385936593a100b7d1c28d205497dd3447c6cccdb6174f83e5717523
repository/index.go:403	repository.(*Index).generatePackList	1	done
local/local.go:84	local.(*Local).Save	1	Save <index/389ab89cec>
repository/repository.go:317	repository.(*Repository).SaveUnpacked	1	blob <index/389ab89cec> saved
repository/repository.go:412	repository.(*Repository).saveIndex	1	Saved index 0 as 389ab89cecbb329734fb46e1b03f79fba2007bb6b80a9ab86da494f85b10445a
repository/repository.go:289	repository.(*Repository).SaveJSONUnpacked	1	save new blob snapshot
local/local.go:84	local.(*Local).Save	1	Save <snapshot/7a4dfb5517>
repository/repository.go:317	repository.(*Repository).SaveUnpacked	1	blob <snapshot/7a4dfb5517> saved
restic/lock.go:99	main.unlockRepo	1	unlocking repository with lock PID 7 on 4a946fa5b37d by root (UID 0, GID 0)
lock was created at 2020-12-06 12:07:01 (564.8426ms ago)
storage ID 35dfa1c6
local/local.go:211	local.(*Local).Remove	1	Remove <lock/35dfa1c65b>

So restic is stuck on removing the lock in the repository??
As I said, this happens not only on local repositorys but also using a rest-server

1 Like

The culprit seems to be goroutine 35, where restic seems to be busy determining whether it runs in the background.

That matches your debug log. Remove only prints a debug log entry before removing the lock file, but not once itā€™s finished. The next step after removing would be to stop the terminal output goroutine, but that one seems to be stuck.

This should be fixed in

In case we look at other strace logs in the future: The epoll_pwait is from goroutine 0 which checks for network connections which have received new data to be processed. Or in other words, it is (nearly always) totally uninteresting for debugging.

2 Likes

Thanks a lot for looking into the issue and resolving it!