B2, transactions, lack of compression

thedaveCA · March 28, 2018, 11:51pm

Howdy! I started playing with restic on a small web server. I have been testing zbackup+rclone to take hourly snapshots (including webspace, conf files, and exported MySQL data), then rclone synchronizes to B2.

I played around with restic a bit last night, eventually configuring it to backup the same content as zbackup, both on an hourly schedule.

How can I predict/control the number of calls to B2? Is it based on the number of files, amount of space already stored? Or more related to the number of changed blocks? Or something else entirely?

Most of the calls seem to be related to list file name transactions (656,697/754,024), this is only 25 snapshots worth, so assuming daily snapshots this looks to be around 21 million/month, which will run me around $10/month. The 11GB of data will be less than $1.

I should note that zbackup is running too, and while I can’t tell how many are zbackup vs restic, I’ve been doing zbackup hourly snapshots for years and typically spend less than $0.30/month on class C transactions, whereas with restic I could be looking at $10.

None of these numbers will leave me unable to pay rent, but this is a small personal server that I’m using as a testbed, I could easily end up with over 600GB of data if I decide to start using this solution for the larger professional web servers, and/or our in-house data at around 1TB.

I am running restic_0.8.3_linux_amd64 version.

For reference, restic is using 10.3GB for 25 snapshots, zbackup is using 8.9GB for 294 (going back to 2018-02-17). Raw total of the files on disk is ~10.7GB.

Is there anything I could/should be doing, in particular to the transaction count?

Will these costs likely scale with the amount of data as I start using restic on other servers with substantially more content?

Is there any progress on compression? The Github bug didn’t seem to indicate that compression would be any time soon, but it is certainly possible that progress is being made that simply hasn’t been documented yet.

stampy · March 29, 2018, 3:18am

Just adding to this thread instead of creating a new one. For knowledgeable users, please, correct any of my statements.

Restic is hands down awesome, but it uses way too many b2_download_file_by_name transactions when restoring and checking backups. I do not know if the problem is confined to B2. If that is the case, please confirm, as that could help new users.

I assume, from info I cursorily read, that the problem is accentuated by the way Restic performs these actions (sometimes has to use multiple transactions per single file, as the files are “BLOBified” in the process).
IMO the users should be made well aware of this in the instructions until a more efficient process is created. This to avoid being stung by unexpected fees. I, for one, would have appreciated a BIG warning on the check instruction page

I see how compressing and collating multiple blobs on the server before sending these to the B2 repository could help reducing greatly the number of transactions (other than the repository size) and I feel that this should really be prioritized.
A few “average” WordPress folders can easily reach tens of thousands single small files to make a very common example. In this scenario the few $ bill might scale up quickly.

Of course is possible to pipe to Restic a pre-compressed file, but that way I assume that we lose the joy of dedup offered by Restic, effectively negating the advantages of compression.

I want to finish this message thanking fd0 and any other person working on Restic. Not just for the software itself, but for the support provided here and in the issue tracker. You really handle it better than a few big name pros!

764287 · March 29, 2018, 7:25am

Can’t add any useful information for you but wanted to mention that this topic was discussed recently and @fd0 gave some nice insights about it.

Depending on your data you might end up with some similiar numbers for restic. One of my repositories as an example: the raw data is about 390G but restic uses only 355G for 480 snapshots. It all depends on how much your data changes and how well it fits restic’s chunk size.

fd0 · March 29, 2018, 7:54am

Hey, welcome to the forum!

@thedaveCA would you mind telling us which commands (exactly) you run hourly? The amount of transactions for just running restic backup feels way too high.

I guess that maybe you’re not only running restic backup, but something like restic backup && restic check && restic forget && restic prune or so, is that possible?

Yes, that’s the case. I’m not entirely sure how it is about other backends, but B2 charges for operations we routinely need during check and restore, which aren’t optimized yet. I’ll come to this in a minute.

When we started with restic, we primarily had the local and sftp backend in mind. That was my personal use case which I’m still interested in today. Then somebody contributed the s3 backend, and eventually the b2 backend was added.

So, from the start, doing backend requests was cheap and easy to do, we did not even have a metadata cache. We eventually added it in 0.8.0 at the end of 2017, and optimized the backup operation so that it performs well for most cloud backends (and B2 in particular). Slowly changing restic’s inner workings to better cope with cloud-based backends isn’t an easy thing to do…

But this is just the beginning, the other operations, especially check and prune are not yet optimized. We don’t have the resources to do all at once, but we’ll get there!

There’s plenty of stuff to do, besides adding compression (which requires changing the repo format), and I’ll finish writing the new archiver code to fix nasty bugs such as #549, then I’ll have a look at how to add compression.

I don’t have much time now, I hope I’ve answered most questions. If not, ask again and I’ll catch up later

thedaveCA · March 29, 2018, 12:12pm

I’m not yet near a computer, but I’m only running backup.

With the ID, key and password removed, I first set:

export B2_ACCOUNT_ID=

export B2_ACCOUNT_KEY=

export RESTIC_PASSWORD=

export RESTIC_REPOSITORY=“b2:crusher:restic”

Then I have a loop that determines what to run, the final command looks like:

restic_0.8.3_linux_amd64 backup --option b2.connections=20 --tag $LABEL $BACKUPDIR

This is called 6 times, once for each tag, with one or more directories passed as $BACKUPDIR. About 245K files across all 6 tags.

My intention was to write a check, and if successful then forget for each tag, and a final prune. But that code doesn’t even exist yet, it’s commented out zbackup code.

I can likely dodge the entire transaction issue by backing up locally and using rclone sync to get the data out to B2, as that is how zbackup works. I have the spare disk space on this little server, but what attracted me to restic (vs zbackup) was to skip this very step for our much larger (and more volatile) internal datasets.

I appreciate the history, and it completely makes sense how the initial design might not translate to how B2’s billing model is designed. It’s definitely an interesting project.

764287 · March 29, 2018, 1:00pm

Take a look at restic-runner. It is a helper script which basically does what you described.