Friendlier --read-data-subset checks

damoclark · October 5, 2024, 11:18am

Hi All,

I’d like to propose a new feature in relation to the --read-data-subset option of the check command. And I would like to seek some input and feedback from the broader restic community before creating a github issue (and potentially submitting a PR after discussion with the dev team).

The current feature is great for checking a repository in increments, particularly large ones. While this is very flexible in defining how to achieve full coverage of the repository, its not very friendly to the casual user who just wants to check their repo routinely and aim for full coverage of the repo over a given period of time. This is especially true when coupled with the challenges of aligning regular repository checks with scheduling tools such as cron. It’s just not as easy as it could be.

As an example, if I want a repository to be checked in its entirety every week, and execute a subset check daily, then using cron, I need to do something like the following:


00 0 * * * restic -r <repo> check --read-data-subset=`date +\%u`/7

Note, if you are like me, you waste time forgetting that percent is a meta-character, at least in some cron implementations, and for literal use such as with the date command, it needs to be escaped with a backslash.

For a larger repository, I might resort to daily executions with full coverage over a month, which I might add is imprecise using this method due to months having inconsistent numbers of days. A more elaborate method is necessary for this situation, and others.


00 0 * * * restic -r <repo> check --read-data-subset=`date +\%d`/31

Choosing 28 days to ensure, no matter which month of the year, there will be full coverage of the repository each month does not work because 29/28 is invalid input for restic:


Fatal: check flag --read-data-subset=n/t values must be positive integers, and n <= t, e.g. --read-data-subset=1/2

But using 31 to avoid this issue has the downside of missing 1, 2 or 3 subsets on shorter months with 28, 29 or 30 days.

In the spirit of:

Make the simple things easy, and the complicated possible.

What I propose is to add new syntax to the existing --read-data-subset option that makes it easy for a user to perform routine checks of their repositories. So in addition to the existing syntax of subset-number/total-subsets, augment this model with check-execution-frequency/check-coverage-frequency. Here are some concrete examples of what I mean, scheduled with cron:


00 0 * * * restic -r <repo> check --read-data-subset=`date +\%u`/7
# Becomes:
00 0 * * * restic -r <repo> check --read-data-subset=daily/weekly


00 0 * * * restic -r <repo> check --read-data-subset=`date +\%d`/31
# Becomes:
00 0 * * * restic -r <repo> check --read-data-subset=daily/monthly


00 0 1 * * restic -r <repo> check --read-data-subset=`date +\%m`/12
# Becomes:
00 0 1 * * restic -r <repo> check --read-data-subset=monthly/yearly

Plus other combinations that might be useful but difficult without writing elaborate scripts.


00 0 * * mon restic -r <repo> check --read-data-subset=weekly/monthly

00 0 * * * restic -r <repo> check --read-data-subset=daily/2weekly

00 0 * * mon restic -r <repo> check --read-data-subset=weekly/3monthly

00 0 1 * * restic -r <repo> check --read-data-subset=monthly/6monthly

Can you infer the execution frequency and full coverage frequency for the above examples?

The idea is to unburden users with writing code to calculate the correct combination of x/y according to their preferred schedule, and let Restic figure it out for them, according to simple human-expressible frequencies.

So my broader questions to the community are:

Have you not implemented regular data check of your repo because of this complexity?
Do you use the random selection method (i.e. --read-data-subset=20%) because its too difficult to figure out deterministic checking?
Or otherwise, would this feature be useful to you?

On the assumption that there is value in investigating this, I propose the following implementation. I’m interested to hear what the developers think.

When processing values provided to the --read-data-subset option to the check command, detect the following predefined values, and based on the current date, convert them to their matching x/y counterparts, and continue execution as if the x/y values were provided on the command line.

Frequencies
daily
weekly
monthly
yearly

Where the full coverage frequency can included an optional multiplier, which when omitted, assumes 1. For example:

7daily = every week

2weekly = every fortnight

3monthly = every 3 months

2yearly = every 2 years

1yearly = yearly = every year

I am not considering multipliers for the check execution frequency value because most scheduling tools don’t support such granular execution schedules. For example, using cron, it is not easy to execute a command 2weekly (fortnightly).

So this approach has the benefit of being simple (input parsing and translation). I would be prepared to give this a go, despite having no go programming experience (I’m sure there was a potential pun there somewhere). I have an algorithm that I am happy to share in a technical post, for discussion and feedback from developers.

An alternate approach, that is not my preferred option is to support syntax of the form --read-data-subset=/y where y is the number of check executions necessary for full coverage, and an execution counter is incremented upon each completed check invocation within the repository config file. So the repository would track the last subset check that was completed, so it knows which subset to check on the next, and each subsequent execution thereafter. The repo is exclusively locked during a check, and so the config file could be written to with the incremented subset upon successful completion of a subset check.

This type of solution might be useful when checks aren’t performed on a fixed routine, but rather at opportunistic times, such as when a repository is detected as quiescent. It would also have the benefit of not missing any subsets, due to execution scheduling failure or interrupted checks. The missed subset would simply be checked on the next invocation of the check.

Still, this approach would be a little more invasive, and only worthwhile if there were sufficient interest.

Thoughts?

fede · October 5, 2024, 1:59pm

Hello! Great post

Yes, I am in that situation. I manually run a check of 1 or 10% of certain backups and manually verify the reports.
I am just trying to create a script that simplifies the check, sending of the report, etc…
Yes. Just to simplify
Yes!

kapitainsky · October 5, 2024, 7:42pm

How it would work when backup is not run on exactly daily schedule? What if some backups are missed (computer not in use for example) or multiple backups run on some days?

Re your questions:

I do not find it complex at all. It is actually very simple and flexible IMO:)
No. I use n/m and persist n value in extra config file (incremented after every successful check run) used by my wrapper. Initially I used unix epoch day number modulo m but it had disadvantage of missing some checks when backups were skipped or run the same check multiple times. I am happy to share my few lines of bash if anybody is interested.
Not immediately. Depends if it solves any problem I have.

damoclark · October 6, 2024, 2:36am

From @kapitainsky and @fede - two diametric responses. And two broad groups of users that my proposed idea seeks to support.

Hi @fede and thank-you for your feedback.

Hi @kapitainsky

Thank-you for taking the time to critique my idea. You raise some very important points, which I would like to respond to.

I agree it is very flexible. Restic’s current functionality supports the second part of my guiding philosophy well:

Make the simple things easy, and the complicated possible.

With the current n/m model, there are endless implementations for repo checking. But for the simple act of deterministically checking your entire repository over a given time period, to have to resort to scripting, I would not classify as “easy”. At least, not for the less tech-savvy users. And for technically capable people like me, at the time I was left wondering, “Why can’t Restic just figure this out for me? It could be easier than this.”

With Restic-integrated solutions that cover the simple cases, it is also cross-platform. Save the script sharing for the complicated cases, supported by those with the technical skills to implement them.

But in service to the above philosophy, my proposal retains the n/m semantics - it doesn’t impede your current practice. What I am aiming for is to:

Make it possible for users without scripting skills to perform simple routine checks of their repos;
Make it easy for users with scripting skills, to perform simple routine checks of their repos, without having to write scripts; and,
Retain the existing powerful and flexible functionality, for more complicated scenarios, and continue supporting current practices such as yours.

I love “what-if” questions. This is a very good point.

damoclark:

An alternate approach, that is not my preferred option is to support syntax of the form --read-data-subset=/y where y is the number of check executions necessary for full coverage, and an execution counter is incremented upon each completed check invocation within the repository config file. So the repository would track the last subset check that was completed, so it knows which subset to check on the next, and each subsequent execution thereafter. The repo is exclusively locked during a check, and so the config file could be written to with the incremented subset upon successful completion of a subset check.

This type of solution might be useful when checks aren’t performed on a fixed routine, but rather at opportunistic times, such as when a repository is detected as quiescent. It would also have the benefit of not missing any subsets, due to execution scheduling failure or interrupted checks. The missed subset would simply be checked on the next invocation of the check.

Still, this approach would be a little more invasive, and only worthwhile if there were sufficient interest.

Perhaps there might be sufficient interest in implementing this functionality after all. But not my preferred option - only because, as a complete Go newbie, I wouldn’t attempt a PR myself. Would be more work for the dev team to review than for a more capable Go programmer to do themselves.

But I think you have validated, based on your existing solution, and your “what-if” question, that something like this might be worth considering. @kapitainsky, do you have any suggestions on how this idea could be improved?

This PR I could do as a Go beginner, be done reasonably timely, and not become too much of a burden on the reviewers. And might be “good enough” start for many people, accepting your “what-if” caveats.

Thanks again for engaging in this discussion.

fede · October 6, 2024, 9:57am

Which one are you using?

fede · October 6, 2024, 9:59am

Maybe because I’m more like a newbie?

That is to say, for beginners or people who don’t know much about scripts, your proposal might be more useful.

In any case @kapitainsky provides very good arguments.

kapitainsky · October 6, 2024, 9:59am

my own. It is simple bash script

fede · October 6, 2024, 10:02am

Nice!

I am happy to share my few lines of bash if anybody is interested.

I only work on Windows but if you can share the code that would be great.

kapitainsky · October 6, 2024, 10:06am

I will post some examples in recipes category. Let’s keep this thread focused on @damoclark enhancement discussion.

damoclark · October 7, 2024, 12:43am

The feature request is to help groups 1 and 2 below, but not hinder group 3. Win/win/win.

MichaelEischer · October 21, 2024, 7:19pm

There’s currently no place to store such information and I doubt that the config file is the right place for it. We cannot guarantee on all backends that config file updates happen atomically, that is there would be the risk that an interrupted network connection breaks the repository. So, at least with the current repo format, regular changes of the config files are too risky.

MichaelEischer · October 21, 2024, 7:22pm

Well I once had the misfortune of implementing a wrapper script that checks a different part of the repository each week. That took a while to get right. So having a --read-data-subset=week/3month would have made things a lot easier. As that uses a calendar based schedule, it’s probably not suitable if the check runs on a users device. But for backups of servers, it would definitely work.

alexweiss · October 21, 2024, 11:17pm

Just wanted to mention that I implemented this here and share the conclusions I made for this extension:

For n, convenient names like hourly, daily, weekly, monthly are a good and helpful extension and pretty easy to implement (by standard time functions, be aware to use % m after).
Having a state for n (like counting up to a given number for repeated check calls) is difficult and I agree with @MichaelEischer that the config file is for sure not a good place. So, IMO this is something which should be left outside of restic. Environments, where it is important that no specific run is omitted, usually would anyway use their specific scheduling tool which is able to set an appropriate value for n (like the number of the day of the supposed run instead of the number of the day where the run actually is started).
For m the value to use obviously depends on the choice of n (e.g. the number of weeks per year differs from the number of day per year). Moreover, in many cases the values are constant (like number of hours per day or number of weeks per year) or nearly-constant (number of days per month), so this is just a convenience and could be also just set by the user, e.g. hourly/24 or daily/28. So one could argue that there is not much value in adding friendlier m values.
Using constants for m should IMO also be used by users who want an individual duration, e.g. use weekly/12 if you want to check weekly such that after 3 month (which is a bit more than 12 weeks) a full check cycle is accomplished.

damoclark · October 22, 2024, 11:09am

Thank-you Alex for sharing what you learnt.

Am I to understand, based on the commit timestamps, you implemented this feature based on my contribution in this forum post?

If I have understood correctly, authorship was shared with simonsan, but there is no credit to the original source of the idea.

Thanks Michael. I hadn’t considered atomicity limitations of cloud services.

The only time the config file is written to is for repository upgrades which are very infrequent.

The information could be stored in a new directory in the repository, using the same approach of add, then delete files (i.e. like in the key directory). And this concept has merit more broadly - storing a maintenance record within the restic repository. It would, for instance, be useful to record within the repository, when the last prune was executed. If one hasn’t been performed for some duration, the user could be reminded. The same for checks.

We see in the user forums, when these maintenance tasks aren’t performed regularly, it can risk the user’s data. And I have already demonstrated the friction for users to perform routine and systematic full checks of their repositories.

Alexander Neumann said in his interview on changelog.com back in 2021:

The other thing is that Restic must be easy to use. That’s really important, because as I already said, whenever there’s friction; when I have to look something up in the man page and I’m not able to find it - like the command line of tar, for example, is really awful to new users… And whenever you need, for example, while restoring an important file and your boss is on your back and breathing into your neck, and then you have to look up what the tar command line is - that is just not gonna work with backups.
…
So it must be really easy to use, and we’re still using this to improve the workflow whenever we add a feature or correct something, we make sure “Well, how does this feature look like for new users? Are they able to understand it, or is it too complicated?”

If we want to promote good practices with our backup users, we should be looking to reduce friction in performing these practices. Like I said before, routinely checking our repos, at the moment is harder than it needs to be.

Which I guess begs the question: Is such a feature worth the potential disruption of a repo format change?

I have a different perspective.

A wise person once said:

The usual rule of thumb is that a more complicated implementation is ok if that simplifies the interface (all within certain limits obviously).

I agree with this sentiment entirely. Which of course is easy to say when you aren’t responsible for said “complicated implementation”.

And by all accounts, you are a very sophisticated computer user.

The same wrapper script challenge exists for non-24/7 computers, and with a likely greater novice user base.

@fede’s responses to my questions, are exactly what I anticipated for a novice user, and his response, while not empirical, anecdotally validates my argument for these new features.

fede · October 23, 2024, 10:46pm

I’m still creating a script for these tasks, but it’s really like reinventing the wheel, because other users have probably already developed something similar. It would be great to have something more standard.

damoclark · October 23, 2024, 11:33pm

And cross-platform.

MichaelEischer · October 24, 2024, 8:07pm

Yes that data can be stored, but that’s a major change to the repository format. So currently the benefit / cost ratio is rather low from my perspective. We have more pressing issues to attend to.

But we can definitely go for a 80% solution that only requires 10% (or less) of the work. (aka. not messing with the config or introducing new datastructures)

Changing the repository format just to track when the last check was run definitely no long adheres to “all within certain limits obviously”. I don’t want to spend a full weeks worth of work just to implement a minor feature.

Anything more flexible than the initial --read-data-subset=daily/weekly suggestion would require some way to track state which is too much work for now.

alexweiss · October 24, 2024, 9:22pm

@damoclark You are right that inspiration of this came from this topic. But also the read-data-subset has been implemented just a few weeks ago and there I thought about other/better options which I anyway wanted to add. There was no intent to not giving any credit. It is simply that not always all sources of inspiration are cited - which is sometimes right and sometimes wrong.

I actually wanted to add something to the general debate about --read-data-subset=n/m:
I get the impression that people do think if they run this m times with values n=1..m, the full repository is checked. Which is true of course, but only under the condition that the repository does not change during all of those runs. And this condition typically is not satisfied.
So, in theory, this can mean that you always have some inconsistent data in your repository but your check --read-data-subset runs don’t detect any of them.

That said, it’s still a good idea to use it and you’ll find errors with a very high probability (as when using random subsets). It is just not worth to over-engineer the n/m thing if at the same time you try to check for a moving target…

damoclark · October 25, 2024, 11:26am

You know how much your time is worth more than anyone else. I respect that.

Am I right that the overhead in making a change to the repository is quite high (i.e. making change, new tests, user migration tools, documentation etc), and it is not worth all that overhead for what is proposed?

If so, could this be something that is added to a low priority list, that could be rolled into the next repository change for something more substantial?

Just a thought.

damoclark · October 25, 2024, 11:44am

I propose ideas in a public forum for the benefit of the community, just like all the other contributors, including yourself. I’m glad you considered what I proposed and decided to implement it in rustic.

I don’t want to appear precious - it’s not like I am solving world hunger here.

On reflection, my comment likely stems from the fact that: a) I work in academia, and you always cite the work/ideas of others that you build upon; and b) I actually put quite a bit of time, research and thought into what I proposed. The Gregorian calendar is a bugger to work with, as I am sure you probably know.

I appreciate your comments.

D.