Backup using cephfs extended recursive attributes

Hi all,

First, I want to congratulate all people that makes restic possible! (it’s awesome!)

I would like to share my use case so you could advise me. We are evaluating restic for backing up a huge volume of data (hundred of millions of files, hundreds of TB in size) placed in cephfs and backing it up in our S3 service.

Ceph folders have recursive attributes that tells you, for example, the time stamp of the last modification of any file or folder under the selected folder. Our idea was to take advantage of this information and discard unmodified full branches of the directory tree while going down into the depth levels to speed up the backup process.

The first approach was to develop a tool to traverse the folder and create a file with the list of the subfolders and files to backup and then pass this file to restic using the parameter --files-from. This worked fine and the backup speed improved dramatically. The problem is that when mounting or restoring the snapshot only shows/restores the files/folders backed up in that snapshot, not the incremental view of the folder.

After that, I tried an opposite approach, creating an exclude file and sending it to restic using the --exclude-from parameter but I have exactly the same problem. This approach is better cause when I do restic snapshots the snapshot path is showed like a normal snapshot while in the first approach you see all the files in a block view making the things more confused. I also tried with the --parent option but seems that it does not work as I expected.

Is this behavior with the --files-from / --exclude-from expected?

Any suggestion about any other solution to this specific use case? Could be this a feature? for what is exactly the --parent option?

Thank you in advance!!

Roberto

It sounds like this would be quite the design overhaul to accommodate this, and would have various trade-offs (some good, some bad). Perhaps instead of that, I think this:

might be a better approach to solving the problem. What I mean is that instead of altering restic’s fundamental backup paradigm/architecture, we find a way to optimize traversal of trees that are unmodified. That way you don’t have to fiddle with --files-from being different for each backup and it would be faster.

HOW to do that, I’m not entirely sure – in my experience restic is pretty fast in determining when a file is unmodified; but I don’t use ceph either…

That’s expected, since you gave restic only a tiny subset of your data (the files/folders that have changed).

When you tell restic to save only a specific set of files/dirs, that will be contained in the snapshot. Nothing more :slight_smile: That’s expected.

Yes: the best way forward would be to figure out a way to present the files/folders to restic in a way so that it can determine itself if a file has been changed or not.

First, it needs a previous snapshot of exactly the same set of files/folders, so make sure the paths don’t change in between runs. Sometimes, this happens with file-system snapshots, like with zfs you have /mnt/zfs-snapshot-20181009200100 or so, and the path is different for every snapshot. That makes restic think it should backup completely new data, so it will read and hash everything, only to discover later that all data is already in the repo. When the paths are constant, restic automatically finds the latest parent snapshot and will print a line similar to using parent snapshot <id>.

Next, it compares the files/folders to save with the data in the parent snapshot. For each file, it’ll look at:

  • The file type
  • The modification time (mtime)
  • The file size
  • The inode

If any of these attributes have changed, it’ll re-read and re-hash the file, which is expensive and takes a long time. If you can mount cephfs in a way so that for unmodified files it will communicate this to restic via the file system attributes mentioned above, then restic will not re-read the file and probably be much faster. You can use the command stat to read all these attributes.

2 Likes

Thank you both for your suggestions!

Yes, it is! :smiley: I also use restic for personal backups and I’m really happy. In our use case we have two handicaps, one is the volume of the data and the other that is mounted file system this is why we are investigating fancy solutions.

Also with --exclude-from? Because in this case the path is the same (or at least is showed like that in the snapshot view)

Now I think that works like that, stat is implemented in ceph-fuse so restic should be doing its job. The advantage of the extended attributes in cephfs is that if you have a folder which is not newer than the last snapshot, is safe to just discard everything under it because the attribute is updated even if a file was modified at any level inside.

I understand that this is a very specific use case and is hard to integrate into standard procedures but I really appreciate your answers!

Best,

Yes, also that: In this case you’re telling restic “backup this directory, but leave out these parts” and it won’t look at the files/folders which match exclude patterns.

By the way, you can see if restic decides to re-read files by looking at the stats printed at the end of the backup run (compare “unchanged” vs. “changed”) and if you turn on the really verbose output with -v -v, then it’ll tell you for each file what it did.

I did a short test backing up the linux kernel sources:

  • full backup:

Files: 61436 new, 0 changed, 0 unmodified
Dirs: 2 new, 0 changed, 0 unmodified
Added to the repo: 848.493 MiB

processed 61436 files, 850.107 MiB in 20:52

  • And after modifying one file in the folder:

Files: 1 new, 0 changed, 61436 unmodified
Dirs: 0 new, 2 changed, 0 unmodified
Added to the repo: 2.009 KiB

processed 61437 files, 850.107 MiB in 11:38
snapshot 6eef9f50 saved

So I guess that restic is doing its job well. (btw, why is saying that there are only 2 dirs new when there are ~4k inside the folder?)

Thank you again!!

Because of this bug: