I’m also having memory issues when running “restic check” or “restic prune” on a virtual Ubuntu 16.04 standard virtual machine with 4GB memory and 1GB swap. Some details on the repo:
[4:19] 545420 directories, 2661668 files, 830.173 GiB
restic disk usage: 551G # happy deduplication!
105 snapshots # why I want to run prune!
When I run restic check with GOGC=20, I get the following error:
check snapshots, trees and blobs
fatal error: runtime: out of memory
When I run restic prune with GOGC=80, I get the following error:
find data that is still in use for 105 snapshots
EOF
ReadFull(<data/4de420d135>)
github.com/restic/restic/internal/restic.ReadAt
/tmp/restic-build-354782844/src/github.com /restic/restic/
internal/restic /readerat.go:41
...
main.main
src/github.com/restic/restic/cmd/restic/main.go:69
runtime.main
/usr/local/go/src/runtime/proc.go:198
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:2361
I can increase swap on my SSD in an exponential search (2,4,8,…128 GB)
Hm, very interesting! This is not a memory issue. I think you found a corner case which the code does not handle yet.
Can you please have a look at the file whose name starts with 4de420d135 in the data/4d subdir? Is that file empty? Can you please run sha256sum on it?
Ok, what happens here is that somehow the transfer of the file in question was interrupted and/or aborted without restic noticing, and the file is truncated. During prune, restic first recognizes that the file is invalid:
incomplete pack file (will be removed): 4de420d135c5[...]
But then it tries to load data from the end of the file, which isn’t there:
EOF
ReadFull(<data/4de420d135>)
And the code does not handle this failure case well, so you see a stack trace.
You can see that the file is broken because the file name does not equal the SHA-256 hash of the contents:
Congratulations, you’ve found a bug. On the downside: Your repo is missing data and I don’t have any idea how that happened
I’ll try to reproduce the situation locally so that I can push a fix so you can run prune again…
Once I’ve fixed the bug and you can run prune again, most of the repo can be used just fine. Once you remove the snapshot which references the data that isn’t there, the repo will be without errors again. No need to push the data again, but maybe it’s a good idea to run restic check --read-data afterwards to make sure there aren’t any other truncated files…
For now you can try moving the file from the repo to somwhere else and running prune again. But maybe first run restic check --read-data to see how many other truncated files are there (and how much data stored in these files is still referenced).
I think that problem is that my wife and kids use the laptop regularly, I am scheduling an hourly backup from the laptop, and the the laptop is regularly put to sleep. So, I probably increased my likelihood of hitting this bug.
Is there a chance that pushing the data from the laptop to rest-serve instead of directly to the repo via sftp would either remove or reduce the probability of this bug happening again?
And, I’m also happy to run via sftp for another week or so to see if your fix works in the wild.
Sincerely and gratefully,
Chris
Ps. Please take time with your family and friends, and consider how you can delegate tasks to the Restic community. I don’t know much go, but I do have OS X, Ubuntu, and have compiled Restic on Ubuntu.
I’d also be willing to help fix this, especially if it can be reproduced. (I’m following the issue on GitHub.)
I have a feeling it might have to do with restic’s error checking or retry logic with regards to HTTP errors? Anyway, I don’t know, but worth looking into.
Thank you again. I am working the problem this weekend.
Is there a chance that pushing the data from the laptop to rest-serve instead of directly to the repo via sftp would either remove or reduce the probability of this bug happening again?
If there is a greater than epsilon chance that it will, I would like to try it.
If your transfer costs are high, you should wait for Alexander’s advice before continuing, but: I have always found HTTPS to be faster and more reliable than SFTP, so, I dunno; if I was you, I’d try using rest-server if there’s little or no cost to you. Re-attempting the backup certainly shouldn’t hurt your repository.
Matt, Thank you for your advice. My current plan is to 1) finish duplicating the repo, 2) run restic check --read-data, 3) delete any damaged files, 4) try prune again. Then, given that you find https to be faster and more reliable, I’ll set up rest-serve and give it a try.
Ok, I’ve started running “restic check --read-data” with 512 unreferenced packs and > 115 “tree errors” so far (see below). When “restic check --read-data” is done, I will plan to remove the truncated files, then run “restic check (without --read-data)” and “restic prune”.
Does this seem reasonable?
Will any missing data in the repo be replaced the next time I run a backup?
I will also save the “restic check --read-data” log file and can confirm how much data stored in the truncated files was still referenced.
partial log below:
restic@ubuntu-backuppc:~/log$ grep -e "not referenced" restic.check.read.data.log | wc
512 3584 50688
restic@ubuntu-backuppc:~/log$ grep -e "error for tree" restic.check.read.data.log | wc
115 460 2875
restic@ubuntu-backuppc:~/log$ grep -e 638c23837cba4d5c3f1f62ec96f41cee251ec3a48568ea0280da3fa98cb1951d -e 8782d054 -e 24d71c4c0e *read*log
pack 638c23837cba4d5c3f1f62ec96f41cee251ec3a48568ea0280da3fa98cb1951d: not referenced in any index
error for tree 8782d054:
ReadFull(<data/24d71c4c0e>): EOF
restic@ubuntu-backuppc:~/log$ grep -e "not referenced" restic.check.read.data.log | wc
512 3584 50688
restic@ubuntu-backuppc:~/log$ grep -e "error for tree" restic.check.read.data.log | wc
115 460 2875
storage ID f37b267d
load indexes
check all packs
pack 638c23837cba4d5c3f1f62ec96f41cee251ec3a48568ea0280da3fa98cb1951d: not referenced in any index
pack bbb2ba7355b04de303a2352f89174731194134950eb833c21a7407bbacc35fbe: not referenced in any index
pack f839786e464ac0e8d7a3a73874ea92e778f9a806646bffaf6fe0e89ca36b9729: not referenced in any index
...
check snapshots, trees and blobs
error for tree f7820509:
ReadFull(<data/24d71c4c0e>): EOF
error for tree 44b1172e:
ReadFull(<data/9837b351e7>): EOF
error for tree 9cc3ebb4:
ReadFull(<data/4de420d135>): EOF
error for tree b9c4ab51:
ReadFull(<data/4de420d135>): EOF
error for tree 82dda40f:
ReadFull(<data/4de420d135>): EOF
Some updates: 1) I made a copy of the repo; 2) I ran restic check and removed 3 truncated files, but 3) restic check; restic rebuild-index; restic prune does not work with either 0.8.2 or 0.8.3.
Could longer logs or running with debug be helpful?
Could running “restic backup” again or restic from the “prune-agressive” branch be helpful?
Gratefully,
-Chris
restic check:
load indexes
check all packs
pack 52bc7418d...: not referenced in any index
pack 8067d4b62...: not referenced in any index
...
pack 2968ceba8...: not referenced in any index
pack 118fefb892...: not referenced in any index
pack 2334bc7c4...: not referenced in any index
pack 24d71c4c0...: does not exist
pack 9837b351e...: does not exist
pack 4de420d13...: does not exist
check snapshots, trees and blobs
Tue Apr 24 22:12:09 PDT 2018
restic rebuild-index:
Sun Apr 29 08:20:56 PDT 2018
counting files in repo
[11:55] 100.00% 121839 / 121839 packs
finding old index files
saved new indexes as [a270eaf6 4e3e0e01 1fd82b99 2a569f6c a73c42b2 8e9d90f3 7888e2b8 06b42ee8 f4841954 ca430f63 5b1079ac bc107d69 86ab4
279 01dfd00b bb592cd1 eb98af0a 932db14f c7c5352e 51cc36f8 c935c9f0 ed1c42da ca9b6b78 04982b42 66edb561 f9893a15 9d4a13fc 06a5f743 e28f0
87d ced4085f 095ee8d8 cd721caa a5f2ac3d b569beda 0e8f9e87 1250f5d9 1c7811b1 981e025f 305065c6 dc547d10 9a251e4f 03abded5]
remove 41 old index files
Sun Apr 29 08:37:15 PDT 2018
restic prune:
Sun Apr 29 08:37:15 PDT 2018
counting files in repo
building new index for repo
[5:31] 100.00% 121839 / 121839 packs
incomplete pack file (will be removed): 039a87e...
incomplete pack file (will be removed): 0409c3e...
incomplete pack file (will be removed): 045a9fd...
...
incomplete pack file (will be removed): fe7d4476b...
repository contains 121745 packs (2716576 blobs) with 568.842 GiB
processed 2716576 blobs: 20563 duplicate blobs, 1.369 GiB duplicate
load all snapshots
find data that is still in use for 126 snapshots
tree d66450e99bf65fdc0d248a57d772109ed77b2b8544b5d5f53392106dad187d02 not found in repository
github.com/restic/restic/internal/repository.(*Repository).LoadTree
/tmp/restic-build-413028034/src/github.com/restic/restic/internal/repository/repository.go:620
github.com/restic/restic/internal/restic.FindUsedBlobs
/tmp/restic-build-413028034/src/github.com/restic/restic/internal/restic/find.go:11
github.com/restic/restic/internal/restic.FindUsedBlobs
/tmp/restic-build-413028034/src/github.com/restic/restic/internal/restic/find.go:31
github.com/restic/restic/internal/restic.FindUsedBlobs
/tmp/restic-build-413028034/src/github.com/restic/restic/internal/restic/find.go:31
github.com/restic/restic/internal/restic.FindUsedBlobs
/tmp/restic-build-413028034/src/github.com/restic/restic/internal/restic/find.go:31
github.com/restic/restic/internal/restic.FindUsedBlobs
/tmp/restic-build-413028034/src/github.com/restic/restic/internal/restic/find.go:31
main.pruneRepository
src/github.com/restic/restic/cmd/restic/cmd_prune.go:191
main.runPrune
src/github.com/restic/restic/cmd/restic/cmd_prune.go:85
main.glob..func17
src/github.com/restic/restic/cmd/restic/cmd_prune.go:25
github.com/restic/restic/vendor/github.com/spf13/cobra.(*Command).execute
/tmp/restic-build-413028034/src/github.com/restic/restic/vendor/github.com/spf13/cobra/command.go:698
github.com/restic/restic/vendor/github.com/spf13/cobra.(*Command).ExecuteC
/tmp/restic-build-413028034/src/github.com/restic/restic/vendor/github.com/spf13/cobra/command.go:783
github.com/restic/restic/vendor/github.com/spf13/cobra.(*Command).Execute
/tmp/restic-build-413028034/src/github.com/restic/restic/vendor/github.com/spf13/cobra/command.go:736
main.main
src/github.com/restic/restic/cmd/restic/main.go:69
runtime.main
/usr/local/go/src/runtime/proc.go:198
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:2361
Sun Apr 29 08:50:43 PDT 2018
Below is a bash script (block #2) that I think will reproduce an important part of the error that I have been struggling with. It makes a backup of an old Debian distribution (~200 MB). It chooses three data blobs at random and truncates them. Then it runs restic check, prune, rebuild-index, etc.
Looking at the logs more carefully, it seems that restic 0.8.3 did recover from three truncated blobs. I’m running the same sequence of commands on my larger repo to see if that will fix the problem.