Sparse File Support


#1

I am new to Restic, and I just learned Restic doesn’t specifically support sparse files. My backup was calculating ~180GB larger than df showed, and it turned out to be a sparse file causing the difference. Note that deduplication didn’t actually store that much, but it was fully processed.

I found the sparse files on my system with this command, which excludes the virtual file systems and reports files using less than 1.0 of their allocated space:

# find / -type f -not \( -wholename '/sys/*' -or -wholename '/proc/*' -or -wholename '/dev/*' -or -wholename '/run/*' \) -printf "%S\t%s\t%p\n" | gawk '$1 < 1.0 {print}'

0.75    16384   /etc/openldap/certs/secmod.db
0.5625  65536   /etc/openldap/certs/cert8.db
0.75    16384   /etc/openldap/certs/key3.db
0.542857        286720  /var/lib/rpm/__db.001
2.88798e-07     198560880672    /var/log/lastlog
0.997403        1576960 /var/cache/man/index.db
0.999721        106075056       /usr/lib/locale/locale-archive

The file /var/log/lastlog using 0.0000292% of its apparent size of ~190GB. I did some research, and found it is normal for that file to be sparse. Mine is large because I have UID ranges in /etc/login.defs that start very high.

I have excluded that file, and the backup size was normal. However, I really don’t want to restore a server and lose the history of when users last logged in. But without Restic treating this file as sparse, it seems to me it would try to restore 190GB non-sparse and the restore would fail anyway.

What are the implications of restoring sparse files when Restic doesn’t actually support them? Would it result in corruption, if the application that created them intended for them to be sparse? Note that the other sparse files on my system appear to be database formats.


#2

That depends on the application. I suspect most would not care, but others might. Sparse files are mostly a disk usage optimization where the application intends to leave holes of nothing that it can backfill with content at a later time. Having those holes actually be present and contain NUL characters or having them absent (sparse) is not a difference to the application. (When a program reads a hole in a sparse file, the read succeeds and returns just NUL characters.)

However, if the application specifically looks at the size or location of holes in the spare files, this could alter the application’s behavior.

A more important problem is that your disk may not be large enough to fully restore a sparse file, and so the restore process would simply fail.


#3

This will absolutely happen. These VPS instances don’t have that much space, nor do I want to use it up wastefully even if they did. I suppose I could skip that file when restoring.

@fd0 Would it be difficult to support sparse files in Restic?


#4

You may want to subscribe to this issue:


#5

That issue describes many of my own findings. I am subscribed now. Thanks!


#6

You’ll end up with a file of the apparent size filled with zeroes. Which is fine, most of the time, until you run out of disk space.

It won’t be easy:

  • we’d need to add detecting sparse files for all the OS restic supports so that it can skip over the holes in the files
  • we’d need to add restoring sparse files for all the OS restic supports
  • we need to change the repository format to add information where the holes were. I have a hunch that this is possible without introducing a breaking change in the repo format and I have some ideas.

It’s planned to add that eventually, but nobody volunteered to do this for now, so I guess it’s not important enough. Personally, I don’t have a need to support sparse files, at least not for now.


#7

I actually didn’t think I needed it until I started using Restic for full system backups. I don’t use sparse files, but CentOS does in a few cases (haven’t checked Ubuntu). I’m working around it by not backing up /var/log/lastlog. I would guess anyone doing full system backups hasn’t noticed how sparse files are handled, or hasn’t run out of space on a full system restore. For example, my lastlog on my Ubuntu box is only 282K, which would make no difference to me on restore.


#8

[OT Musings] Personally, I’d be inclined to report that upstream as a bug. It’s ridiculous that a lastlog file, sparse or not, reports its size as 199GB. My brand new LinuxFromScratch reports an 8 byte file as 15KB and my Debian install, about 2 years old, reports a 24 byte file as 18MB. Actual:Reserved 8:15,000 -> 24:18,000,000 -> 5734:198,560,880,672 (reversed from 2.887e-07).

A database it may be, but if it reserves space like that there is clearly a flaw in the design. I wonder what upstream would report is the difference between actual and reported sizes as the file grows. If it doesn’t eventually flatten out (and 199GB suggests it doesn’t), that is more evidence that the design is flawed. Why is it using a database engine? There is no point to having a database that has very limited expression parsing and search capability, yet the “overhead” is billions of bytes of reported size that causes problems with restore software. We users would be better served with a text file and an internal grep. [end-musing]


#9

As far as I remember the lastlog file is a “database” only in the sense that it uses the sparse file feature of Linux file systems explicitly for indexing purposes. It works roughly like this: If the user with ID 1000 logs in, it writes to /var/log/lastlog at offset $UID*200 (assuming the record size for the entry for each user is 200 byte). If the file does not contain any data before offset 20000, the file is made sparse and has a hole. When the UID is very large, the hole will be huge, so you’ll end up with an apparently huge file containing just a few bytes of information.

I suppose at the time lastlog was written this was seen as an acceptable trade-off :wink:


#10

@Nick_C, I agree with you. Practically everything else is just a flat file that gets grepped or parsed, including /etc/passwd for UID mappings, so it’s surprising that particular feature needs a “database” interface.

Back before support for 32-bit UID and GID numbers were added, around the year 2000, the maximum allocation would have been ~13MB. We had multi-GB drives then, so you are probably right regarding the trade-off.


#11

For sparse files the holes don’t need to be saved or reproduced exactly. We just need any large blocks of zeros to be eliminated so it doesn’t blow up the disk space. I would assume the chunking code would start generating a known set of checksums after a while when given a gigabytes of zeros in the middle of the file. Perhaps the extraction code would have a special case for some checksums that are known to be blocks of zeros and just do a seek rather than looking up the data.


#12

That is a very interesting idea! Indeed, a block of zeroes of sufficient size (> 512KiB or so) will always generate the same ID (which is just the SHA256 hash of that number of zeroes), so we could maybe add code to the restorer which (on demand) creates sparse files when it sees a sequence of this ID in a file. That’d even work without modifying the data structures in the repo, so it’s backwards compatible…


#13

Then the followup optimization is on the reader side. After you seen a the zero hash a couple times you can switch to a special reader mode that queries the OS for the sparse regions of the file so they can be skipped quickly.

On LInux is appears there is a special ioctl that works on some filesystems:https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt
Not ideal, but better than using the ext2utils.

This isn’t necessary, but I suspect it might be a lot faster on filesystems that support it. Again no archive format changes required.


#14

Would this solution create new sparse files on restore, if non-sparse files with long runs of zeros exist in the backup?


#15

Note that this approach of restoring runs of zeros as holes would break applications that rely on SEEK_HOLE, as the holes in sparse files would not be reliably reproduced. I don’t think this would be a good approach as it would break expectations.

Sparse file support should be either absent, or correct. “Faking it” is IMO an unacceptable approach for backup software, which above all else should reliably restore your files exactly as they were.


#16

Good question, I don’t have an answer for that, it depends on the implementation.

Can you please elaborate that? I don’t have much experience with sparse files (besides simulating files filled with zeroes)

Are you aware of any applications depending on the exact size and location of the holes in the file?


#17

SEEK_HOLE is a seek target that asks the kernel to seek to the beginning of the next “hole” after the offset (or back to the start of the current “hole” if the offset indicates a position in the middle of a hole).

http://man7.org/linux/man-pages/man2/lseek.2.html

Not specifically, but the presence of SEEK_HOLE would indicate that some programs likely use this functionality.

To elaborate on my “all or nothing” approach, programs are likely written so that they don’t depend on holes being there – otherwise, they would not work on filesystems without sparse file support. (That is, they should be able to function if a file was copied to and from a filesystem that doesn’t support sparse files, or restored from a backup.)

However, it is probably a very reasonable assumption on the part of many developers that if there are holes, the holes are where the application put them, otherwise SEEK_HOLE would serve no purpose. Restic’s chunking approach means that it’s virtually guaranteed that the holes will not be in the same place.

  • Too-small holes won’t get their own chunk and the hole won’t be restored.
  • Chunks on the border of a hole, particularly at the beginning of a hole, are likely to be chunked such that some number of zeros are part of a chunk that isn’t all zeroes, causing the restored hole to start too late in the file.
  • There may be a runs of zeros in a sparse file, but that run wasn’t a hole, and making that section a hole could confuse the application.
  • To a lesser degree, putting holes in a file that wasn’t originally sparse could pose some problems. In particular, it can lead to more file fragmentation when that hole is later filled, as well as an optimistic reporting of free space – as soon as those holes are filled with data, additional volume space is consumed. If a sysadmin is not expecting files to be sparse in the first place, this could cause problems down the line following a restore operation where the sysadmin thought they had more disk space than they needed. (In other words, having a run of allocated zeros not only reduces fragmentation, but reserves the disk space.)

It does not seem like guessing where holes should be put is a good idea. If we want to implement this hack, I would suggest making it opt-in with a flag, and giving ample warnings in the documentation that this flag could break applications.

I submit that it would be a better use of time and resources to work towards a patch that adds real support for sparse files, rather than investing in a hack that has the potential to restore effectively corrupt data (from the perspective of an application).