Does restic write every file to a pack (locally), on every backup?


#1

If I’ve understood the code correctly, restic will (re)create packs locally in a tmp dir, and then
Backend.Save() will generally short-circuit the upload/save procedure if the pack’s hash already exists in
the repo. So it’s examining every byte of every file being backed up, and writing all of it to temp files, every single time, regardless of whether the file has changed or not. Is this correct?

Do other backup programs generally behave this way, or do they ignore files “older” than the previous backup? I guess to be completely sure you have accurately captured the data on disk, you can’t rely on the file create/modify time, since technically they can be altered.

I’m not saying restic is doing anything wrong or unusual, but I wonder how this compares to other programs like Time Machine, rsync, Backblaze, etc. (I understand you’re not the author of those programs, but maybe someone who knows can comment?)

One thing that jumps to mind is the fact that every “incremental” backup writes 100% of your data to your local disk, which over time might have some impact on SSD write cycles (maybe that’s an outdated concern on modern hardware?)

Thanks in advance for any insight you can provide.


#2

Hi, and welcome to the forum!

Did you discover the design document yet?

That’s not the case, you must’ve misunderstood the code.

On the first backup, restic reads all files and splits them into chunks. The chunks and metadata information are then bundled together into so-called “pack files”, which are uploaded into the repo. The pack files are first stored locally in a temporary directory, usually /tmp, which on most Linux systems is a tmpfs, so it’s not even written to a disc.

On a subsequent backup, restic tries to find a previous backup. If none is found, it proceeds in the same way as an initial backup. The only exception is that already stored chunks are not saved again (that’s the deduplication at work). When a previous backup can be found, restic will read the metadata from the previous backup and only read files that have been changed since the last backup.

Using restic backup --force restic can be instructed to re-read ll files and not use any metadata from older snapshots. It will still only upload chunks that are not in the repo yet.

So, does this answer your questions?


#3

Hi Alexander,

I have indeed read the design doc (which is very good, BTW), and understand the basic concepts.

Despite the documentation, I managed to confuse myself by trying to trace what happens from the top down, beginning with cmd_backup.go. I find the pervasive use of channels, splitting jobs, piping and re-grouping jobs to be kind of challenging to understand. That’s not a knock on your code- it’s a complicated problem, and you’ve developed a very robust, performant solution.

I think I understand now that in the case of an existing parent snapshot, you’re walking the filepath in a deterministic order, starting with the tree as it is found in the repo. The sorted, deterministic order is what allows you to detect nodes that have been deleted or added.

And as for detecting changes to existing files, I think I found the part that tells us when to re-read the bytes on disk, resulting in potential new blobs, re-packing, etc:

from IsNewer():

if node.ModTime != fi.ModTime() ||
	node.ChangeTime != changeTime(extendedStat) ||
	node.Inode != uint64(inode) ||
	node.Size != size {
	debug.Log("node %v is newer: timestamp, size or inode changed", path)
	return true
}

So restic checks to see if any of the creation time, modification time, inode or size have changed from what is known in the repo. This makes sense, and seems like the right thing to do.

I can imagine a pathological case where bytes have been changed in a file, but the length remains the same, the inode remains the same, and the user has gone to pains to carefully reset ctime & mtime. I think in this case restic (or any reasonable incremental backup program?) would miss it. Probably a good alternative to having to scan every byte on disk for incrementals!

Thank you for your patience in answering questions. In addition to being great backup software, restic as an open-source project is great for learning.


#4

I agree with you. That part is the oldest part of restic, and badly needs refactoring. It shows clear signs of me learning Go, with all the usual beginner mistakes (among others, too much concurrency and too many channels). I’ll come around to rebuilding that part eventually.

Indeed, that’s exactly the point. The code walks two trees in parallel: The one loaded from the repo, and the one from the local directory to save. The idea I had was that this allows restic to efficiently traverse even very large trees, but I think I over-optimized a bit and now the code is hard to read. sigh :slight_smile:

This is a quite extraordinary situation manually created by the user. restic can safely be used in this case, just pass --force to the backup command so that all files are re-read and re-hashed. Only changed blobs will be uploaded. The default situation (trust the filesystem to get the modification, size and inodes right) works for the vast majority of all users. The others think about this and will probably find --force sooner or later :wink:


#5

I agree with you. That part is the oldest part of restic, and badly needs refactoring. It shows
clear signs of me learning Go, with all the usual beginner mistakes (among others, too much
concurrency and too many channels). I’ll come around to rebuilding that part eventually.

Well like I said, it wasn’t meant as a criticism of your code. Concurrency is (IMHO) the main
challenge to readabilty in any non-trivial Go program. It’s still head-and-shoulders above trying to
grok all the async callbacks in even a simple Nodejs program. :slight_smile:

This is a quite extraordinary situation manually created by the user. restic can safely be used in
this case, just pass --force to the backup command so that all files are re-read and
re-hashed. Only changed blobs will be uploaded. The default situation (trust the filesystem to get
the modification, size and inodes right) works for the vast majority of all users. The others
think about this and will probably find --force sooner or later :wink:

I agree, that’s why I called it “pathological”. I don’t think it’s worth trying to address this case
during standard restic operation, and the --force option is a completely adequate solution for this rare problem.