Backing up round robin to initially cloned repositories <-> local restic cache

Sina · September 21, 2019, 1:02pm

My idea is as follows:

I initially clone my restic repo located on an external USB HDD to X further external HDDs (cloning by simply copying is much faster here than creating initial bsnapshots to newly created repositories on each HDD); from now on NO further cloning
create further snapshots round robin to all those HDDs (HDD 1 first, after some time [with changing data] to HDD 2 and so on)

I suspect running into trouble when using a single local Restic cache. (on local SSD) for all those backups (cache containing information not always consistent with all used HDDs)
Is this correct?

If yes: Is there a way to solve this without having to deal with destnation (HDD) specific restic commands, which might lead to mistakes (and trouble) too easily?
Would putting the cache on the destnation HDDs (or using no cache at all) significantly decrease the speed of any (which?) operation?

In case someone tried this or has a guess due to advanced knowledge of restc behaviour, please let me know.

Thanks.

moritzdietz · September 21, 2019, 1:27pm

So all of those repositories contain the same data, right?

Sina · September 21, 2019, 1:55pm

Only right after the initial cloning.

After that there will be a repository wit the latest snapshots, the scond latest “backup session” will have been taken place to another repository and so on. So data contained in the repositories will start to differ more or less over time. Especcially there are “areas” more important to be backed up than others. So the numper of snapshots per “area” might differ significanttly from repo to repo. Except the snapshots cloned intially there won’t be any more snapshots being present in more than one single repo.

moritzdietz · September 21, 2019, 2:07pm

Gotcha. I am re-reading your initial thoughts and concerns but am having trouble to come up with reasons why things would break. Meaning, I am currently thinking that there you won’t be having “issues”.
Maybe I am also not quiet understanding the problem you think you will run in to.
So if anyone else can jump in in case I am missing something rather big, jump in please.

So as far as I understand your setup, you have N number of repositories which all start out the same.
After a while they drift apart due to different datasets being backed up to those repositories.
If you don’t use a cache, then yes things will be slower. That’s what the cache is for.
You can use different caches for each repository by adding the --cache-dir argument to your restic command and then specifying the location of the cache for the repo you’re working with.

But personally I don’t see why a single restic cache could lead to issues.
The only thing you have to keep in mind is the disk space the cache will take up.
So if your partition where the restic cache is on, is very small, you have to make sure you’re not filling that partition up to 100% usage.

Sina · September 21, 2019, 2:39pm

I am re-reading your initial thoughts and concerns but am having trouble to come up with reasons why things would break. Meaning, I am currently thinking that there you won’t be having “issues”.

E. g. restic might “think” from the cache there are blobs present in the repo that in fact aren’t in the currently used one, but only in another repo used for a previous backup using the same cache.

If you don’t use a cache, then yes things will be slower. That’s what the cache is for.

I thought the cache was mainly to improve speed when using online storage by caching information locally that then isn’t needed to be transferred again and again over the internet.

moritzdietz · September 21, 2019, 3:34pm

Restic knows how to handle its own cache. I wouldn’t want to worry about that.

Restic cache is not only used for repositories for backend which are not local to the same machine; restic cache is always used. Except you tell restic to no use a cache. This speeds things up no matter what backend.

Sina · September 21, 2019, 4:19pm

So cache handling is aware of the possibility that information in the cache might be wrong? (e. g. because of it being used for multiple repos with same ID as in the use case desscribed above) So from my understanding there must be be either some kind of cache check towards the actually found repo… or no information in the cache from prveious runs is used… Would be interested how that is implemented. (in case you like to give some more insight into the cache handling details)

I imagine the cache may reduce seek/transfer time on/from the repo location resulting in increased overall speed whenever the cache location gives faster response than the repo location would.

moritzdietz · September 21, 2019, 8:57pm

Just to make sure: Have you given this a look already? References — restic 0.16.3 documentation
Because this section basically answers your question/concern:

Each repository has its own cache sub-directory, consisting of the repository ID which is chosen at init . All cache directories for different repos are independent of each other.

Sina · September 28, 2019, 2:56am

Unfortunately I don’t find a clear answer to my question/concern here - I think it doesn’t match my case (identical repository IDs in my case vs unique repository IDs assumed here). Could you please explain your conclusion? Actually the quoted docs section increases my concern.

userr1 · September 28, 2019, 4:58pm

Hi @moritzdietz,

He is not using init to initialize the repository, but is smiply copying the repository at different times (weeks,months?) from the original repository. Doing something like

 $ cp <daily-hard-drive>/restic-repo <weekly-hard-drive>/restic-repo

This is why he is asking about the cache.

cdhowie · September 28, 2019, 5:03pm

I believe that the cache basically keeps the index state of the repository. Keep in mind that restic is designed to work with concurrent access, which means the cache is validated on each run.

To put it another way, restic already has to cope with new and missing data from other systems running backup or forget/prune so I would expect restic to be able to handle this same situation just as well; if the cache is keyed by the repository ID, then to restic it would simply look like something else changed the repository and it will react by updating its cache.

We do much the same thing in production – we have a “front repo” that systems back up to, which keeps only a week of backups. This data gets imported into a “back repo” that keeps much more history. This keeps restic’s RAM usage during backup down. Both repos have the same ID and we haven’t witnessed any issues relating to the cache.

Sina · September 29, 2019, 8:48am

userr1:

moritzdietz:

Each repository has its own cache sub-directory, consisting of the repository ID which is chosen at init

Hi @moritzdietz,

He is not using init to initialize the repository, but is smiply copying the repository at different times (weeks,months?) from the original repository. Doing something like
 $ cp <daily-hard-drive>/restic-repo <weekly-hard-drive>/restic-repo
This is why he is asking about the cache.

Thanks for giving this example. My potential use would be like this:

$ cp <uneven_weeks-hard-drive>/restic-repo <even_weeks-hard-drive>/restic-repo

From then on no further cp, but restic with alternating hard-drives.

moritzdietz · September 29, 2019, 2:40pm

Ahh! Ok - yeah I must’ve misunderstood your setup then.

Sina · September 29, 2019, 4:31pm

Thanks for the explanation. Do you think this update will completely rebuild the cache?

Sounds interesting… How do you import the “front repo”'s data into the “back repo”?

cdhowie · September 29, 2019, 4:48pm

I’m not sure if the cache is totally rebuilt as in downloading all index files (@fd0 may be able to answer that) but it would have to at least download new index files and remove local copies of missing ones or very basic functionality would break in any situation where multiple restic clients share access to the same repository.

My point is just that what is being described here would look exactly the same to restic as though the same repository was just modified by another client. Either both would have to work or neither would.

On my system, ~/.cache/restic contains directories that are named after the repository ID, so I do believe that they would share a cache, unless you use --cache-dir when using one of the repositories to maintain a separate cache.

The following basic script is used. I’ll explain each line.

#!/bin/sh

cp -aln /var/restic/front/data /var/restic/back/ && \
cp -aln /var/restic/front/snapshots /var/restic/back/ && \
/usr/local/sbin/restic-front forget --prune --group-by host,tags --keep-within 7d && \
/usr/local/sbin/restic-back rebuild-index

Lines 1 and 2 hard-link all absent data and snapshots from the front repository to the back. Running as two separate commands with the packs processed first ensures the repository remains consistent (otherwise there is a small window of time where snapshots are added but the requisite data is not).

The cp options are as follows:

-a: copy recursively, preserving basically everything (ownership and timestamps are what I care about there)
-l: instead of copying contents, hard-link copied files (this means no additional disk space is used, and it’s safe since repository files are never changed)
-n: do not copy files that already exist in the destination

Line 3 is a standard forget+prune line to remove all snapshots except those created in the last 7 days from the front repository. Note that a significant amount of duplicate data is removed here, and this is expected since we’re copying all pack files from an “outside” perspective; we don’t (and can’t) avoid copying in duplicate blobs. (Edit: My bad here, this applies when pruning the back repository, not the front one. Sorry for the confusion.)

Line 4 rebuilds the index for the back repository, which is required for future actions on the back repository to function correctly; we added new data to the repository but it’s absent from the index.

Note that this script should not be run while any exclusive lock is held on the back repository, especially if that locking operation is prune; throwing new data in the middle of a prune could easily result in some of the new data being incorrectly removed. This can’t be enforced by the script since restic has no way to acquire a lock for use by commands outside of restic.

Once this feature is implemented, we will alter our scripts to use it instead:

github.com/restic/restic

Add command to copy all data to another repository

opened 10:33AM - 25 Oct 15 UTC

closed 08:18AM - 30 Aug 20 UTC

fd0

type: feature suggestion state: work in progress

During the discussion in #320 we discovered that functionality may be helpful to… copy all data (data blobs, tree blobs, snapshots) from a repository to a new one, recreating pack files and indexes on the fly. This allows creating a new repository in a different location (e.g. moving from a local repository to an sftp-server) and using that from now on without losing any history and old snapshots. This issues tracks the implementation of this feature and can be closed when it is implemented.

Sina · September 29, 2019, 5:31pm

Could you please explain in more detail? Seems I miss something I need to understand how a significant amount of duplicate data is removed here. Thanks.

cdhowie · September 29, 2019, 6:16pm

Pruning either the front or back repositories will rewrite packs to remove objects, and the new packs will have a new ID. If the old pack still exists in the other repository (which is very likely) then when the front packs are copied to the back, many of the objects contained in those packs will already exist in other, rewritten packs in the back repository that don’t exist in the front repository.

Front repo pack A has objects 1, 2, and 3.
Pack is copied to back repo.
Front repo is pruned.
- Object 2 is found to be unused.
- Pack A is rewritten to be pack B containing objects 1 and 3; pack A is then deleted.
Later, pack B is copied to the back repo.
The back repo still has pack A. Now it also has pack B, which contains an extra copy of objects 1 and 3.

The “copy these snapshots to this other repository” command will eliminate this problem, as restic itself will copy only the objects that are needed and don’t already exist in the back repository. We don’t have that level of precision just copying data files – without the password and decryption routines, we can’t even know what’s in them.

Sina · September 29, 2019, 10:46pm

Sounds to me like a prior misunderstanding of the term “remove” here. In case you meant “relocate” or “move” I do understand. In case you meant “remove” like command “rm” in the sense of “vanish” I still didn’t get it.

akrabu · September 30, 2019, 12:00am

I think what he’s doing in a nutshell is copying all the data from the front repo to the back repo, then pruning the front repo (which has grown to several months, perhaps) to only having 7d worth of data. So although all these snapshots got copied to the back repo, they didn’t get MOVED. Pruning is what removes them from the front repo, hence the significant amount of data is being removed line.

I think “duplicate” may have been the word that tripped you up. I don’t think he means duplicate within the front repo alone. He means duplicate as in “existing in both the front and back repos”. I think. Correct me if I’m wrong.

cdhowie · September 30, 2019, 2:56am

Pruning the front repo rewrites many of the front repo’s packs. Copying those packs into the back repo is what causes the duplicates in the back repo (which are discovered and removed when pruning it).

To clarify: it’s expected behavior and nothing is wrong. It’s a side-effect of copying restic’s data files around outside of restic, and I knew while I was writing the scripts that it would happen.

This is correct.

This is close. “Objects that exist in both repos but in different pack IDs” would be more accurate.

The tl;dr is “many of the new packs copied from the front repo to the back repo likely contain some stuff the back repo already has.” Pruning the back repo fixes that.

And, as mentioned before, the new “copy” command being worked on will eliminate this inefficiency as restic will be the one doing the copy and it can just avoid copying the data that’s already in the back repo. We can’t avoid that using the current workaround, but it’s just that: a workaround. I can live with the inefficiency on the backup server itself if it means the production servers don’t have to use as much RAM to run their backups.