How much metadata is produced?

phunni · September 27, 2017, 10:07am

Having backed up the majority of my video files one or two directories at a time, I decided I was close enough now to simply backup the entire video directory and rely on deduplication to make sure it didn’t take too long.

The total size of my video library is 579G and restic says (after 38 hours of backing up - with an estimated 10 remaining!) that we’re 455G through. Backblaze, however, is reporting that I currently have 592.6G data stored there - I have nothing else other than the video backups there.

I was already over 400G of the way through before I started this backup and, honestly, am shocked at how long it’s taking. I know that the B2 backend causes a lot of extra API calls (I forgot to use --force), but, even taking that into account, I’m surprised at how long this is taking.

Mostly, though, I’m surprised at how much data is going up. If I only have 455G backed up (according to restic) then, with 592.6G of data stored that’s 137.6G of meta data… Unless deduplication isn’t working properly…

The command I ran to back everything up:

restic -r b2:redacted:video backup /opt/video/

An example of the commands I ran to get everything else up:

restic -o b2.connections=32 -r b2:redacted:video backup /opt/video/Blackadder/ /opt/video/studio_ghibli/

I also, sometimes, used the --force flag.

rawtaz · September 27, 2017, 10:55am

How long it takes obviously has to do with the network connection and latency. What are those, in this case?

fd0 · September 27, 2017, 10:58am

At least for me that is not unexpected. At the moment, restic will always re-read (and re-hash) all files in a new backup when a backup of a new set of directories is made.

So if you backup /foo/bar and /foo/baz before, the next backup of exactly these two directories will use the old backup to check if files have changed (mainly by timestamp). When a file hasn’t changed, the list of blobs from the previous backup is taken.

When you backup an new directory, e.g. /foo, or add a directory, restic will currently re-read all files again. This may be optimized in the future, but that’s the way it is right now. And I suspect this is the reason why the backup run takes so long for you right now.

I’m not sure how accurate that estimation is, but depending on the number of files and directories and also the file sizes, restic needs to save the following metadata information:

For each directory, a JSON document is saved that looks like this: restic/doc/design.rst at master · restic/restic · GitHub So if the directory has many files in it, or the files are very large, the documents become large. Unfortunately restic doesn’t support compression yet (the repo format needs to be changed, which we’ll do at some point, but this breaks backwards compatibility for older versions)
Files are split into blobs, which are then repacked into pack files and saved in the repo. Each pack file has a header at the end which states what’s contained in it. This will enlarge the raw data that is there slightly. Restic needs to know which blob is stored at what offset in each of the pack files. In order to not have to read all headers of all pack files at every run, restic writes index files (in the subdir index/ in the repo) which is basically a JSON document with a list of pack files and their contents, like this: restic/doc/design.rst at master · restic/restic · GitHub When there are many blobs and pack files, also these files will probably get large.

For a large collection of video files (which tend to be very large and not have any duplicate data at all) the metadata that needs to be saved is much larger than for other types of files. The sizes you stated (455GiB done vs. 592GiB stored at Backblaze) seem too large for me, that’d mean 30% metadata size → way too much. Also: What’s Backblaze displaying? Gibibyte (as restic) or Gigabyte?

Did you maybe interrupt several backups? In this case there might be data stored in the repo that is not referenced, a run of prune should clean this up.

This is likely not the B2 API, but the re-reading and re-hashing of all data. You should be able to observe the IO of restic on the disc versus the network: The latter should be low, the former high. Also, restic will probably use a lot of CPU.

Can you check how much is really stored within the B2 bucket?

fd0 · September 27, 2017, 11:17am

It looks like Backblaze displays Gigabity. One of my test buckets contains “2.1GB” (Gigabyte) according to the Backblaze website, downloading it reveals that it contains 2095226664 byte of data, which is 1999MiB or 2.0GiB.

So in your case the Bucket probably contains about 565.14GiB, which still is a lot.

phunni · September 27, 2017, 12:58pm

It certainly feels like there’s more than just metadata there - which, presumably, suggests some sort of duplication…?

Am currently at 473GiB/578Gib through - with backblaze reporting 605.8 GB stored. “du -sh” reports 579G in the directory.

I did wonder if it’s something to do with the root directory being used - previously I backed up /opt/video/, whereas this time it was just /opt/video - would that make a difference to restic?

Edit: just realised that my last point may have been what you were referring to with your /foo example

fd0 · September 27, 2017, 1:30pm

It’s possible, the only way this data may have ended up there is when a running backup is cancelled. Did that happen several times? Anyway, the next run of restic prune (which I’d suggest to do with the code from the restic master branch, then you’ll get the awesome cache implementation) will clean that up.

That won’t make a difference concerning the data that is stored in the repo. Each blob of data is just stored once. The metadata may be different, but it seems we agree that it feels too much for “just” metadata.

phunni · September 27, 2017, 1:35pm

I don’t recall cancelling a backup (although it’snot impossible - I’ve done so much restic stuff over the last couple of weeks), but I have had at least one failure - with the JSON error I mentioned in another thread. I’m not sure at which point that failed - but I suppose it could well have uploaded a lot before the error.

Edit: I suppose I have to wait for the backup to complete before running a prune?

fd0 · September 27, 2017, 2:19pm

Yep, prune requires an exclusive lock, which restic won’t give you while a backup is running. Please report back!

phunni · September 27, 2017, 2:48pm

Will do! ETA has dropped to 3 and a half hours - so it won’t be too long!

phunni · September 27, 2017, 4:40pm

OK - the backup is complete and backblaze reported as having 620.8 GB stored from only 579GB of files. That seems like a lot of metadata, so I’m running prune - will report back when it’s complete.

fd0 · September 27, 2017, 7:20pm

Hm, to be honest, that looks reasonable: You have 579GiB of data, and Backblaze reported 592GiB (=620.8GB), that’s 2.2%. That feels much better than 30%!

I think I know what happened here: You’ve saved several directories separately, so much of the data was already in the repo, and restic did not save it again. It spent the time reading (and hashing) the data again, only to then detect that the blobs are already there. So just a bit of metadata is added to the repo.

When you had a look at the size, you saw that restic read ~80% (= 463GiB) of the data, but the repo already contained 564GiB (=592GB) of data. But that was data not only saved in the current run of restic, but in previous runs!

Did I forget anything?

phunni · September 28, 2017, 10:23am

Final quick question - does this mean that every time I now run a backup of /opt/video it’ll take more than 40 hrs? Or is it only because I hadn’t explicitly backed up “/opt/video” before? Will cached metadata help this?

Obviously, I’m aware that if there are new files to backup that it’ll take longer - although, presumably, only as long as it takes to process and upload the new file(s)?

fd0 · September 28, 2017, 10:51am

No, it’ll be much faster.

Exactly, restic will find the previous snapshot then and take that as a template.

Yes, because loading metadata from the previous snapshot will be much faster.

For new files that are added the backup will take some time to upload the new data. How much time that is depends on what data is added. For additional video files (with not much internal duplication that could be optimized) it’ll take roughly the time to upload the file size bytes to B2.

phunni · September 29, 2017, 10:27am

WAsn’t sure whether to start a new thread for this, but figured it was related to this issue since it’s happened during the prune and is (presumably) to do with the amount of meta data stored.

The output from the prune command - so far:

counting files in repo
building new index for repo
pack file cannot be listed 77d121c1: Stat: unexpected end of JSON input
pack file cannot be listed 7800e556: Stat: unexpected end of JSON input
[9:29:21] 49.81%  58410 / 117277 packs
[18:57:09] 100.00%  117277 / 117277 packs
repository contains 117275 packs (410062 blobs) with 578.123 GiB
processed 410062 blobs: 40 duplicate blobs, 63.526 MiB duplicate
load all snapshots
find data that is still in use for 10 snapshots
[1:05] 100.00%  10 / 10 snapshots
found 410026 of 410062 data blobs still in use, removing 36 blobs
will remove 0 invalid files
will delete 0 packs and rewrite 123 packs, this frees 64.115 MiB
[16:17] 100.00%  123 / 123 packs rewritten
counting files in repo
[2:16:49] 12.96%  15201 / 117266 packs

Interestingly, restic claims to have freed up around 64MiB - but backblaze is now reporting 621.4G from the previous 620.8. Not a huge jump, but still the opposite of what I’d have expected…

Also, notice again the unexpected JSON error.

fd0 · September 29, 2017, 5:02pm

Hm, I’ll check with the author of the B2 library we’re using.

When did you look at the bucket size? Was the prune already finished? Restic will remove data as the very last step, so if you checked before it’s expected that the data is still there. Could you check again?

phunni · September 29, 2017, 5:09pm

Ah, OK - it hasn’t finished yet - it’s 50% through the second count of the files in repo - just as it was 12.9% through in the output I posted above.

It’s just that it said rewriting the blobs would free up space and then it claimed to do that…

fd0 · September 29, 2017, 6:20pm

In order to find out what the JSON error is, can you please build restic with debug support (go run build.go -tags debug) and the run DEBUG_LOG=/tmp/restic-debug.log restic rebuild-index (you can abort it after the error is shown the first time.

Then have a look at the end of the debug log, the HTTP request and response headers are included so we should see what the error is about. And please make sure to remove all authentication data from the debug log.

phunni · September 29, 2017, 8:01pm

Will do, although it might be tomorrow now.

phunni · September 30, 2017, 8:55pm

Running now (62% of the way through counting files) - bet it won’t throw that error now!

fd0 · October 1, 2017, 8:51am

FYI: We’ve discovered a bug in the Backblaze library we’re using for restic yesterday: It didn’t reuse connections and always creates new, which led to a number of problems. The current master branch of restic contains the fix. You can find a binary here: https://beta.restic.net/v0.7.3-87-g3afd974d/