CERN is testing restic for their backups

sebastien.gross · March 22, 2019, 6:04pm

Maybe an other success story: https://cds.cern.ch/record/2659420

(if this is not the good place for that please delete this post).

fd0 · March 22, 2019, 6:45pm

Awesome, thanks for the hint!

sebastien.gross · March 22, 2019, 6:47pm

@fd0 maybe you can contact them to see what happens at LAAAARGE scale
However this is still a WIP but promising.

fd0 · March 23, 2019, 8:26am

Heh, they have 16k users with (combined) 3PB of data, but they use one repository (in one S3 bucket) per user, so the memory usage will not be such a huge issue Good trade-off, IMHO.

And it’s just at the evaluation stage for now. I’m curious for the result of their evaluation…

hbauer · March 23, 2019, 9:50am

I bet you are. I am too and I hope they do publish results or recommendation

sebastien.gross · March 23, 2019, 11:01am

keep calm this is their goal for now they just have 200 users with a total of 5M files.
This is already more than one can have for personal backup

hbauer · March 23, 2019, 11:11am

I would say that being able to survive a WIP at an organisation like CERN with this size of files is already a reason to open a (or two) glass of your favourite drink.

robvalca · March 26, 2019, 10:41am

Hi, I am the person running this project, I’ve been around since a while bothering you in the forum/github

As a quick update just to say that this project is progressing fast and I’m very confident about it going into production at some point. Currently we are backing-up 370 accounts daily, and we plan to increase it to 1k shortly.

And also if we do a mesh in one repository only one user would be affected Other reason of this is that we can have more flexibility with bucket placement policies, like moving important users to critical areas, adding extra S3-side replication to certain users, etc… The main problem of this is that we don’t get the full power of the de-duplication but as you said, is a fair trade-off.

Yes, for sure! now the orchestration tools are very coupled to our environment but my idea if this goes into production is to make it more generic and share it.

I will maintain you updated about any news regarding this project and feel free to contact me if you have any question

fbarbeira · March 26, 2019, 12:08pm

@robvalca when you say S3 I assume Ceph, right?

robvalca · March 26, 2019, 12:48pm

@fbarbeira Yes, we are using ceph+radosgw.

fd0 · March 26, 2019, 12:52pm

I’m looking forward to that!

fbarbeira · March 26, 2019, 1:14pm

I’m very glad to hear that! It’s the same approach that we are implementing in our infrastructure. Not so ambitious like yours, but also huge (4k users and 1PB data).

I will stay tuned to your advances!

robvalca · September 16, 2019, 6:47am

Dear friends,

Tomorrow at Ceph Day at CERN I will do a short re-cap of the current status of this project, which is still very promising and growing (14,5k users now, 35M files processed per day, 270T of combined backup repositories). I have a slide only for restic and i will try to spread the word about all its beautifulness!

Cheers!

fd0 · September 16, 2019, 6:57am

Thank you very much for keeping us posted

nicnab · January 28, 2020, 3:13pm

Just out of curiosity… how did the whole thing go? Is restic now used at CERN? I keep adding new users whenever the talk comes to backing up data

robvalca · January 28, 2020, 4:31pm

Hi @nicab

The system is performing really well, is not our main backup system yet, but if everything is going like now, probably we will mark it as ‘production’ during this year. The restore machinery is also there so the users will be able to restore files by themselves.

Now we have 18k users (86% of total users) backed up daily using restic v0.9.6 (upstream version, no custom code), processing around 70M files/11T daily. Currently we are keeping a total of 560T of S3 storage (we apply prune every two days). I’ll participate next week in the 4th EOS Workshop at CERN, where I will give an quick update on the project and as always spread the word about restic beautifulness

nicnab · January 28, 2020, 5:41pm

Thanks for the update! That sounds very impressive I once visited the Atlas experiment before it was turned on and still remember my astonishment of all the superlatives I heard and saw at CERN. Good luck to you!

rawtaz · January 28, 2020, 9:25pm

@robvalca That’s a lot of data. Can you tell us something about the following?

How have you structured the repos? E.g. is it one per client/server machine, or have you split it up even more than that, or even combined multiple machines into common repos?
How large are the various types of repos - how much data do they back up on the machines and how does your deduplication ratio look (just approximately)?
How’s the prune process - How long does it take for how large repos, and how have you split up the prune runs (if you have multiple repos I presume you schedule them so they don’t all run at the same time)? Maybe if you have some output of a big prune run, that’d be interesting.

It’d be very interesting if you told us you have it all in one large repo and prune that ever couple of days :->

cdhowie · January 28, 2020, 9:34pm

And if so I’d be curious how many TBs of RAM they have in the machine running prune.

robvalca · January 29, 2020, 8:55am

We have one repository per user (yeah, which means that we have 1 S3 bucket per user). I think that is a good trade-off as each user backup is not affecting the others. Also, the operations on the repos are faster. We delete the backups when the user leaves CERN, and that would be really expensive to delete data from shared repos.

It depends, we offer a quota of 1TB/1M files per user, so one repo can be from couple of GBs up to hundreds. We also have some specific users with extended quota, which can have > 4TB. The amount of data backed up depends of the day, but I would say between 7-11 TB. Also is important to say that thanks to some features of the source filesystem we can check easily if the backup is needed or not, which means that we don’t run restic everyday in every account (unless all users do some change during that day, which is unlikely) and only runs if there are changes on the filesystem. I haven’t measure this deduplication ratio because is difficult currently as we don’t backup all data from source, we do some filtering because there are files which are not needed in backup. Having different repos per user won’t help in file dedup but CDC will still do its job. I hope i can get this numbers soon, as I’m really interested !

We have specific “restic pruners” which select random backup jobs which are in “Completed” status and applies the prunning (6m, 5w, 7d). We run at least one prune per each user every two days. Here are some prune basic stats about some big repos (start, finish, stats):

2020-01-28 03:51:54 2020-01-28 04:52:27 Total Size: 1.894 TiB, Freed Size: 85.328 MiB
2020-01-29 01:17:38 2020-01-29 03:11:55 Total Size: 1.963 TiB, Freed Size: 1.018 GiB
2020-01-28 00:27:19 2020-01-28 02:11:07 Total Size: 2.039 TiB, Freed Size: 44.401 GiB
2020-01-26 10:01:12 2020-01-26 11:29:25 Total Size: 2.537 TiB, Freed Size: 1.318 GiB
2020-01-27 20:40:22 2020-01-27 23:48:24 Total Size: 2.580 TiB, Freed Size: 188.792 GiB
2020-01-28 04:35:51 2020-01-28 07:22:37 Total Size: 2.715 TiB, Freed Size: 95.134 GiB
2020-01-28 02:13:21 2020-01-28 04:07:02 Total Size: 2.900 TiB, Freed Size: 384.192 MiB
2020-01-26 11:55:10 2020-01-26 14:09:34 Total Size: 3.091 TiB, Freed Size: 89.219 MiB
2020-01-28 18:52:25 2020-01-28 22:03:17 Total Size: 3.307 TiB, Freed Size: 48.862 GiB
2020-01-29 04:56:07 2020-01-29 07:40:32 Total Size: 4.511 TiB, Freed Size: 755.659 MiB

No, there are two virtual machines with 16G of RAM, with 2 prune processes each.

Let me know if you need to know anything else!