How to optimize for deduplication

Hello,
I am planning to backup a Gitlab installation.
The strategy to have a consistent backup in Gitlab is (in short) to clone the repositories and make a single tar of the clones. The archive is then included in the restic snapshot.
In my opinion this could be a problem in deduplication and risk of having a large occupation of storage, when much of the data is not changed.
An alternative is to skip the tar generation and backup the unpacked archive. Restic would find all the git repositories and in this way locate the unchanged files, In theory taking more advantage of deduplication.
Note that file’s mtime changes every backup since the repositories are clones.
Is what I’m saying correct?

I understand your thoughts. But since tar usually only collects the files together and restic does deduplication on chunk level - and not on file level - I think you don’t have to worry about that.
Only one thing: If your are also gzipping the tar, consider using the --rsyncable option. It was introduced long time ago to help rsync better detect changes in an archive and only send the deltas. I never measured it exactly but my impression is that it also helps with restic.

1 Like

What problem are you trying/intending to solve by cloning and taring, instead of just backing up whatever directories you want to back up?

Cloning is a necessary procedure for the consistency of the backup, since with the repositories there is the export of the database and the metadata. Furthermore, with cloning the backup can also be done while working.
My doubt is whether it makes sense or not to make the archive.

I’d say it’s not of much use to archive it. It would be if the contents is so compressible that you gain more in compression than you do in deduplication.

With respect to deduplication, you’re going to have far more problems with git packfiles than you are with making a tarball of the clone.

1 Like

What I do in my backups is a bare clone (–bare) or alternatively you can just backup the .git folder.

Files themselves don’t add anything to the repository folder. You will save some space on your backups :stuck_out_tongue:

For your own piece of mind, make sure you try to rebuild a full repository from a bare one :slight_smile:

Also, don’t re-clone every time you do a backup. Keep your clone and pick up only the changes (git fetch --all). When you clone a repository, git is packing group of files arbitrarily, and they would probably end up different every time. :+1:

@creativeprojects I believe OP is backing up an entire Git collaboration installation, not just a few repositories that they use.

Exactly, the backup procedure involves all the repositories, metadata and databases of the Gitlab suite.

oh right, sorry :roll_eyes: