Native/Parallel backup of block devices

Hi guys,

We have been using restinc for VM backup for the last year.

Our flow:

  1. create a snapshot of the VM disk in our SAN/SDS
  2. export snapshot as disk to one of restic hosts. Snapshot appears like readonly block device /dev/xxx
  3. run backup like reader proces of /dev/xxx | restic backup --stdin --stdin-filename xxx

It works great and we have a lot of backups and restores per day. But, it’s a slow backup. Very slow by design. No parallelism at all for one block device.

And what do you think about parallel and native backup (without pipe and --stdin) for one block device?

Native I mean that restic will read block dev directly.
Parallel I mean restic will split device by X segments like 0-10000, 10001-20000 sectors/bytes and run one thread per one segment.

We have few developers who can try to implement this and make PR upstream.

What do you think about id?

1 Like

What does “slow” mean? If the disks can keep up and there’s sufficient CPU capacity available, restic should be able to process 300+MB/s for a single file (local SSD to SSD). (unless block devices behave very different from regular files).

For me restic backup testfile and restic backup --stdin-from-command cat testfile had similar throughput. (both using a file not a blockdevice).

Another option might be something like --read-special from borgbackup which instructs the backup to also read from device files. Treating those similar to regular files should be relatively simple to implement.

The much trickier part is splitting large files. Independent of the implementation, it will cause later segments to be chunked differently than before as chunk boundaries are not aligned. So there would have to be some way to remain backwards compatible.

Hi, @MichaelEischer

Yes, 300 MB/s is the maximum we can see (usually 250MB/s). And it’s not bad, of course, but when we speak about volumes 4-16TB - it’s not enough.

The first one problem in current behavior is - linear reading process in one thread (like dd if=/dev/vdb | restic …).

Due to the nature of SDS (Ceph for example) which distributes data blocks of one logical volume by tens or hundred independent physical disks, when we read liniary, byte by byte, we can have throughput only of one-two disks/sec (even with a big reading block and big readahead).

But when we read in parallel from 4-5 different regions of virtual disk - we can have throughput of 5-10 disks/sec. Yes, latency can be higher, but it’s not a problem in the backup case.

In other words, if we will read from 4-5 different regions in parallel we can increase data input by 3-4 times in fast environments.

I’m not sure that restic now can handle this throughput at this rate, even with unlimited CPU. Because we can see that backup of /dev/loop0 (file on tmpfs) and repo on tmpfs show us speed about 300-350MB/s. But if we replace the dd|restic by simple dd|dd - we can see a ~2.5GB/s speed.

Maybe the second one problem is restic himself and his internals (which I don’t know well).

Maybe another solution can be run few independent restic processes on one virtual disk like:

backup different parts of device in parallel but in one backup:

restic backup --block-device /dev/vdb --offset 0 --size 50GB --filename volume_backup_1
restic backup --block-device /dev/vdb --offset 50GB --size 50GB --filename volume_backup_1

complete backup when all parts are done:

restic backup --complete --filename volume_backup_1

Of course, running one restic’s process which will do full backup in their threads is much better then multi-process scenario which is hard to handle, but, just an idea.

I don’t understand this point. Now we backup one disk and read it from sector 0 to sector-END. And we did it every backup.
If we will split disk by 5 segments - we will have like 5 disks but smaller size.

I mean that we must do consistent splitting. Every time we have a 100GB disk - we split it by 20GB and segment size will be the same every backup.
It can be specified by --segment-size-in0-persents 20 and every backup we will have 5 segments with the same offsets. I mean that we can’t have any problem with CDC when we backup static block devices.

I think parallel reading will only be possible with a chunker which is able to read in parallel. For me constant chunk sizes would come directly to my mind. I actually don’t know if there exist content-defined chunkers which read in parallel.

What you are proposing @FATruden is basically combining a constant chunk size (where the “constant” chunk size depend on the total file size?!?) with an additional content-defined chunking in a second step. For me this seems overly complicated - I would just use constant chunk sizes (something between 1MiB and 8MiB) for your use case…

What you are proposing @FATruden is basically combining a constant chunk size (where the “constant” chunk size depend on the total file size?!?) with an additional content-defined chunking in a second step.

Yes, we will have two phases.

First phase will split device by chunks/segments depending on total device length/size and max parallel threads specified by admin, for example 5. At the end of this phase we have like 5 “virtual devices”.

Second phase we run 5 independent threads on these five chunks. These 5 threads do independent CDC but they know offset of their chunk (0-50000 sectors/bytes/etc) and this offset helps store data blobs in repo.

For me this seems overly complicated - I would just use constant chunk sizes (something between 1MiB and 8MiB) for your use case…

Mm, good idea. But, in my case, we have a “static” first phase when we calculate chunks offsets only on the start of backup and that’s all. In your proposal we will have “dynamic” chunking detecting. I mean, when thread 3 completes their chunk, 8 MB for example, It must choose the next chunk offset which is not yet processed. I think it’s more complicated. But, i do not know restic internals well, maybe I’m wrong.

So, is it okay if we try to implement something like this?

No, there is nothing dynamic. If we use a constant chunk size of 8MB, then the first chunk starts at offset 0, the second starts at offset 8MB, the third at offset 16MB, the 4th at offset 24 MB, and so on… Every chunk has a size of exactly 8MB except the last one which may be smaller. You know every chunk offset and size directly when starting by just knowing the total file (or block device) size.

Distributing this over n threads is not complicated but a standard problem in parallelization which is trivial to solve (e.g. using channels and distributing the work).
Also note that in the current restic implementation the processing (hashing, compressing,…) of the chunks (once they are read) is already parallelized using this straightforward mechanism.

Thank, @alexweiss, good to know.

Okay, we will try to do some POC and prove theory that parallel reading can increase backup speed of block devices.

You’ll have to increase the number of backend connections at some point. But it should scale to a few GB/s (no idea what’s the exact bottleneck).

That can’t work, the next splitting point returned by a content-defined chunker is always dependent on the one right before. Yes, one can offload the search for splitting points onto multiple threads and then let one thread pick the next one. But that will waste quite a bit of computation compared to the current implementation. (I’d expect twice the CPU load from chunking before one sees a benefit).

From what I understand, this requires reading from different locations that are located significantly apart from each other? So just statically cutting a file into 8MB chunks (which wouldn’t need a chunking step at all) might not be enough to gather the io throughput benefits?

My current feeling is that a fixed segment size might be somewhat more versatile. It would allow growing a file by appending data without causing the segment boundaries to move.

Besides that, splitting the file into segments of known size and then reading&chunking each segment in parallel is a reasonable approach.

This only provides a benefit if the bottleneck is the chunker and not reading from disk.

It depends on SDS. Ceph, for example, stores data by default in blocks of 4MB and distributes it randomly by disks. And in general 8MB = 2 blocks on different servers. 5 threads will read from 5 disks instead of 1 disk now.

Agree with you.

But I think that the block device is black box for restic. Every time when we do a backup, we can have totally different content inside the block device or we can have redistributed data inside the device.
But CDC has very good granularity and we see that deduplication works great even after data redistributing inside the device.
I mean that with so effective CDC it doesn’t matter which method we use to determine offset and how the geometry of the device was changed (resize).
Even if we have reduced efficiency after changing offsets due to resize of the device - it’s not a big problem I think. It’s not a frequent event.

My point - we should think about simplicity in code and CDC does the job. Maybe I’m wrong)

I think - but this is just a guess - that block devices do not profit from CDC much compared to fixed chunks (if aligned to the block size) - or that CDC even might in that case anyway would lead to an alignment on the block boundaries. The point is that anyway complete blocks are touched on a block device, so block boundaries are the natural points to split chunks. This is especially true if the data is redistributed on the block device.

Besides this, deduplication (i.e. not saving identical chunks) and compression does the main work and for most use cases the actual used splitting points don’t have a big (or even measurable) influence. If they had, you would see big differences when backing up the same data with different CDC chunker parameters.

@MichaelEischer Hi!

We have done our work about parallel and native backup of block devices.

Can you take a look?

Some results, which I think are impressive.

Before OLD and NEW tests I created new repo like:

rm -rf /mnt/repo1
restic init --repository-version latest --repo /mnt/repo1
echo 3 > /proc/sys/vm/drop_caches

/mnt - it’s tmpfs.

OLD style backup of block device:

OLD First backup:

dd if=/dev/rbd0 bs=1M iflag=direct | restic -r /mnt/repo1/ --verbose backup --stdin --stdin-filename vol1 --password-file ./pass --no-cache

open repository
repository 63a8595a opened (version 2, compression level auto)
lock repository
load index files
read data from stdin
start scan on [/vol1]
start backup on [/vol1]
scan finished in 0.207s: 1 files, 0 B
4096+0 records in997 GiB, total 1 files 0 B, 0 errors
4096+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 44.2401 s, 97.1 MB/s

Files:           1 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     0 unmodified
Data Blobs:   2703 new
Tree Blobs:      1 new
Added to the repository: 4.000 GiB (4.000 GiB stored)

processed 1 files, 4.000 GiB in 0:43
snapshot c6a2aef3 saved

OLD Second backup:

dd if=/dev/rbd0 bs=1M iflag=direct | restic -r /mnt/repo1/ --verbose backup --stdin --stdin-filename vol2 --password-file ./pass --no-cache
open repository
repository 63a8595a opened (version 2, compression level auto)
lock repository
load index files
read data from stdin
start scan on [/vol2]
start backup on [/vol2]
scan finished in 0.221s: 1 files, 0 B
4096+0 records in997 GiB, total 1 files 0 B, 0 errors
4096+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 42.2168 s, 102 MB/s

Files:           1 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     0 unmodified
Data Blobs:      0 new
Tree Blobs:      1 new
Added to the repository: 177.088 KiB (92.399 KiB stored)

processed 1 files, 4.000 GiB in 0:41
snapshot a2e7fbb2 saved

NEW style backup:

NEW First backup:

./restic_new backup /dev/rbd0 --read-special --read-concurrency 2 --block-size 512  --password-file ./pass --no-cache -r /mnt/repo1

repository 6cd0d9ea opened (version 2, compression level auto)
no parent snapshot found, will read all files
[0:00]          0 index files loaded

Files:           0 new,     0 changed,     0 unmodified
Dirs:            1 new,     0 changed,     0 unmodified
Added to the repository: 4.000 GiB (4.000 GiB stored)

processed 0 files, 4.000 GiB in 0:25
snapshot 29ccaaf5 saved

NEW Second backup:

./restic_new backup /dev/rbd0 --read-special --read-concurrency 2 --block-size 512  --password-file ./pass --no-cache -r /mnt/repo1
repository 6cd0d9ea opened (version 2, compression level auto)
using parent snapshot 29ccaaf5
[0:00] 100.00%  1 / 1 index files loaded

Files:           0 new,     0 changed,     0 unmodified
Dirs:            0 new,     0 changed,     1 unmodified
Added to the repository: 0 B   (0 B   stored)

processed 0 files, 4.000 GiB in 0:20
snapshot 55642c51 saved

Total:

OLD FIRST: 43s, SECOND: 41s
NEW FIRST: 25s, SECOND: 20s

Some highlights:
Slow disks (<250MB/s linear read with iodepth 1): Significantly faster first time backup. The second time will be faster than the first, but not significantly.
Fast disks (>250MB/s linear read with iodepth 1): The first time backup may not be faster. But the second backup will be significantly faster than first!

It’s because we think a bottleneck is “repo writer”. We will try to investigate this problem.

2 Likes

I’ve added a lot of review comments. It might take some time until I can take a look at feedback to those comments.

Thanks for the tests.

Please move those to the PR. I’d like to keep the discussion in one place.

I’d like to see a measurement that just uses --read-special without --block-size. The throughput numbers are low enough that running the chunker in parallel should barely make a difference. My guess is that the iflag=direct for dd is the culprit here, such that you’re actually comparing direct and buffered file IO.

On my system with NVME to NVME with a bunch of large files (5 a 1GB + 1 a 5GB), restic takes 35 seconds to process 10GB of new data (using up to 700% CPU in the process). The second run takes 15 seconds, all with default settings. So for me it looks like the parallelized chunker doesn’t help much in that setting.

What is “repo writer”? If you’re referring to assembling and writing pack files, yes that requires a lot of CPU. I’m not sure that much can be done about that.

1 Like

@MichaelEischer thanks for you comments and discussion about perf!

We will improve our code and I’ll be back with comments about bench.