Split/hybrid/isolated setup with two disks: Backup though contents of data folder have been moved?

tldr: If I move the contents of the subfolder “data” of the restic repo elsewhere will runing “restic backup” still work? I don’t want to run other commands before I move the original contents back. I know that this is not officially supported. But given the current architecture of restic is there a good chance that this approach works?

A similar topic Fragmented storage? was discussed recently. But my case is different because I don’t want to have the full power of restic - I just want to run “restic backup”.

 

 

I have about 5 TB of data to backup. Nowadays the data only grows slowly, maybe 150gb a year. Storing all of it in the cloud is too expensive for me and the initial upload would take forever. So I rely on 3.5’’ external drives. I have one in my house and store the other off-site (at work).

About my off-site storage: I want store my data offsite pretty frequently but at the same time I want to only rarely transport the offsite disk: I mostly commute by train and 3.5’’ is pretty fragile and big. This off-site, secondary HDD is not needed for regular restores - it’s only needed in extreme circumstances like {fire|theft|cryptolocker|power surge|…} that kill my computer and the primary local backup on my first external HDD. The local backup to the first external HDD runs frequently, but it’s enough to run the secondary off-site backup e.g. once a week or month depending on the amount of new photos I have. In the unlikely case that it’s the only data I have left I can live with such a data loss.

The primary local backup will be made with a different software (so that my backup software is not a single point of failure).

 

I think about using restic for my off-site backup.

I have about 200gb of cloud storge with rsync and sftp support. I also have an unused small, external 250gb SSD that’s so small that I can even carry it in my pants pocket.

My idea is to use this SSD (or cloud storage) as a temporary helper for the off-site restic-backup:

  • I make an initial backup on the secondary, external 3,5’’ HDD. Then I mirror this repo on the small SSD while excluding the content of the data folder. Then I store the HDD off-site.
  • Then I regularly run restic backup to the repo on the SSD. When the SSD is nearly full or every couple of months I’d fetch the HDD and sync the SSD back to the HDD with rsync -av SSD/ HDD/. This rsync command wouldn’t delete any file from the HDD (destination) because I don’t have --delete included.
  • Then I’d delete everything in the data dir from the repo of the SSD so that I could start the cycle again.
  • If I ever need to restore something from the off-site HDD: I’d first sync the SSD to the HDD, then I’d restore from the HDD.
  • I would only run “restic backup” to the SSD, all other commands such as “check”, “forget”, or “prune” would only be run against the HDD. After each of these commands I’d sync everything (except for data) back to the SSD.

I guess that this approach is not and will never be officially supported. But given the current architecture of restic is there a good chance that this approach works (and will work for the forseeable future)? Would it also work with a temporary sftp-cloud repo?

I just did a few small test backups and so far it seems to work.

Thanks for your help.

Considering that one takes backups in the knowledge that they can be relied upon, sometimes many years later, I would say that any fragility in using a repository in unsupported ways can only lead to disaster. When you multiply this fragility with a manual process then arguably you don’t have a backup - you have a house of cards which is less reliable than the source data.

It would be my opinion that rather than invest time in making this process appear to work, invest in large enough storage devices to use restic as it’s meant to be used, or find a backup tool that natively supports your use case.

1 Like

I have to support @ProactiveServices. Instead of a quite fragile manual solution involving regularly carrying hard discs you should try to find a fully automatic solution. For the one-time initial backup carrying a hard drive or shipping it is fine, but for the regular backup you should find a friend or so who is able to put your hard drive “online” in some way - or use a cheap cloud storage.

Besides that, from a technical point of view the backup command always only adds data and does only reads tree blobs from the /data dir in your storage backend (and you could use --force to prevent this, but then backup is slow) , but those are also contained in the cache dir that restic creates.

Technically your issue is quite similar to “cold storage” issues (which might also save your cloud problem as those are much cheaper than usual cloud storage). Here is the answer, that restic does not (yet) (fully) support this.

1 Like

Thanks for your answers. My computer skills are limited so I’m here to learn.

My motivation was this: I know my idea is far from perfect. I was looking for the least bad alternative.

 

A cloud backup would be nice but storing about 5tb of cloud storage which seems to be about 50$/month, which is 2400$ over four years. If I use cold storge such as AWS Glacier or Backbalze b2 I’d be at ~25$ per month/1200$ for 4 years just for storage. I don’t want to pay that much given my other finanical obligations.

I could look for cloud backup services for private users that offer “unlimited” storage for little money, e.g. backblaze. But usually these are for Mac and Win only (whereas I use linux) and often there’s unexpected fineprint, e.g. backblaze deletes backups from external drives if they are not attached for over a month, see here and their file history is limited to 30 days be default. So I gave up on this idea.

I could try to run a NAS elsewhere to which I backup. This would be my preferred solution. But at the moment I know no one who a) would have my NAS running in their house or office and b) who also have a decent bandwith for their internet connection. In the shortrun I neither know how to earn more money (given some constraints in my life) nor do I know how to quickly find acquaintances that fulfill the conditions a) and b).

 

I looked for other backup tools. There are tools that support multiple targets. E.g. backup tools from about 15 years ago when you still had to split among different DVDs. But usually these don’t offer a detection of moved or renamed files which is essential for me. Then there are tools that also support tape backups but usually this is complicated enterprise software where a server backups multiple clients. I fear that setting them up will take very long and when there’s much to configure there’s a good chance I make some error.

Most other backup tools also just support one target. So there’s little reason to prefer them over restic which I already know. Actually there’s one hard-link based single user backup solution named “storeBackup” that apparently supports a temporary backup target. Unfortunately it’s unchanged since 2014, I’ve never seen any reports about its stability, the official support forum has just two threads in the last three years, and most recently it’s official site was actually cryptocurrency advertising. Relying on this doesn’t seem like a good solution, either.

Trying to create a custom solution will definitely lead to disaster in my case.

 

I could try a different approach:

  • Reorganizing my data into archive folders that I never change and a small active folder that’s so small that even I can afford cloud storage. In theory this would be a good solution but in the past 10 years I didn’t manage to keep a stable archive - instead I sometimes reorganize, etc. So I’d like to avoid this.
  • I think ZFS and btrfs allow to send snapshots of my file system offsite but as far as I see I can’t just send the snapshot. It might be a quick and small transfer but I’d still need much storage at the other end. And I’d have to invest much time to learn this and in btrfs there’s no encryption (afaik) etc.

 

I don’t mind a manual solution. It’s not too time consuming for me. At least compared to the time I must invest to learn about cron jobs, bash scripting, etc. Each Friday I take some work home to the desk in my home office where I also have my computer. Each Sunday night I pack my backpack again.

 

Every three or six months I could exchange one of my external drives at work. So when I’m unlucky I’ll lose about three to six months. That’s probably less than what most people would lose in my social group who often have no backup strategy at all.

Maybe I can improve over this: I think I’ll use btrfs on my external drives and shortly before I take each drive to its off-site storage period at work I’ll make a btrfs-snapshot. Then I’ll occasionally try my approach with an empty data folder on my SSD. I won’t write this back to the external drive and only rely on it in case I lose my main drive and the external disk I have at home at the moment. Then this can only improve my situation …

 

Thanks for reading another long post. I’m open to any new ideas. Please point out my mistakes.

Just a side note here, the streams produced by btrfs send can be transformed as long as the transformation is reversible later. For example, you could pipe the output through gpg to encrypt the stream. You would just need to pipe through gpg -d to decrypt the stream when sending it to btrfs receive.

I don’t know where you are from, but you can get a 4 bay QNAP NAS for $400 CDN, fill it with 4 4TB Seagate Ironwolf NAS drives (or other NAS rated drives), configure it for RAID5, giving you around 12 TB storage at a cost of around $1200 CDN. Do your initial backup at home, then get a friend to host it for ongoing backups which will require much less bandwidth. In exchange you can let them use a portion for their own backups. Should you ever have to do a massive restore, bring the NAS back home and return it after.

@MichSchnei If you are low on budget: I think I even read in this forum about users that use a Raspberry Pi with an attached USB drive. If you find someone wit a good-enough internet connection to “host” this Pi+your existing hard drive, you’ll get the possibility to automate things for the cost of a Pi!

And I think that most internet connections should be fine with having around 500MB extra inbound traffic per day.

An alternative would be AWS Glacier deep archive. This should be around 5$ per month for 5TB (storage costs). But this is not really supported by restic ATM, so I would suggest to just sync your repo via rclone (has also been discussed in the forum already)

Anyone using GDA should be aware that there are substantial costs for retrieval. S3 pricing is not simple.

You’d probably set up a rule to transition S3 Standard objects to Glacier Deep Archive (GDA) after 1 day.

  • You upload to S3 Standard, which costs:
    • $0.023 per GB-month (prorated).
    • $0.005 per 1,000 files uploaded.
  • After one day, the uploaded files are transitioned to the GDA tier.
    • Now they cost $0.00099 per GB-month (prorated).
    • There is a $0.05 fee per 1,000 files transitioned.

So uploading 5,000 8MB files is going to cost you:

  • Approx. $0.03 in S3 Standard storage fees.
  • $0.025 in upload request fees.
  • $0.25 in fees for the transition to GDA.
  • Approx. $0.04 per month stored in GDA.

About $0.35 just for the upload. Not bad… but wait, there’s more.

Transitioning a file to the GDA tier is a 180-day commitment. If you delete a file or transition it to a different storage tier before it has been stored for 180 days, you are immediately charged the full prorated “early deletion” fee for the remainder of the commitment.

Standard retrieval of an object takes between 3-5 hours. For this retrieval tier, you are charged the following:

  • $0.10 per 1,000 files retrieved.
  • $0.02 per GB retrieved.

Bulk retrieval of an object takes between 5-12 hours. Charges are:

  • $0.025 per 1,000 files retrieved.
  • $0.0025 per GB retrieved.

For both retrieval tiers, this will create a special temporary “retrieved” object in the S3 Standard tier, meaning you will additionally be billed the S3 Standard storage rate of $0.023 per GB-month while that retrieved object exists.

In all cases, you are charged an additional $0.09 per GB of data downloaded from AWS (when you actually access this data with restic).

Let’s say your repository is 1TB. Storage in GDA is costing you $1 a month. Cool, right?

Now let’s say you need to retrieve all of that data. Restic packs are about 8MB in length, so 1TB / 8MB = 125,000 files, give or take. To be generous, let’s assume you can wait up to 12 hours so you use the bulk retrieval tier and you make the restored objects available for 1 day. Your charges for this retrieval are:

  • $3.125 retrieval fees (by file count).
  • $2.50 retrieval fees (by size).
  • $0.77 S3 Standard storage for the restored objects (1TB for 1 day).
  • $90 in egress traffic fees.
  • $0.05 for the GET requests to download the objects.

So $96.45 for the whole restore operation. This all scales linearly, so if your repository is 5TB you can expect to pay about $500 if you should ever need to restore the whole thing.

Now that’s a “simple” example. Real-world cases are a lot more complex with more unpredictable storage costs. If you need the data sooner, you have to pay even more for the GDA standard retrieval tier.

Let’s run Backblaze B2 now.

  • Storage for the 1TB repository costs $5/mo, which is 5x as much as GDA.
  • Retrieval costs $10 in egress fees plus about $0.05 in request fees, so $10.05 for the operation.

In this scenario, the cost to restore from B2 is 10% of the cost to restore from S3 GDA.

So, ultimately, it depends on how often you expect you need to restore. You can restore 10 times with B2 for the same costs as restoring once from GDA.

And then you pile on all of the caveats of using GDA (which are already documented on this forum), one of which is that you can never prune without transitioning everything from S3 GDA back to S3 Standard, which alone is expensive (retrieval fees, plus S3 Standard fees, plus any GDA early deletion fees).

In my opinion, GDA with restic is just not worth messing with. GDA has its uses for other things; restic does not play well with it and there are all sorts of pricing gotchas.

2 Likes

@cdhowie Thanks for clarifying that when using one of the cold storages out there you really have to calculate all costs!

I also agree that restic should not be used to directly access such a cold storage. A rclone repo-copy solution where you see what you do is also recommended for making a cost-calculation more easy.

(Just a side remark: We had this discussion already, and I agree that restic is not yet able to handle cold storages smoothly, but I am quite sure that the repository format would allow this. So this is a question about implementing it as a future feature. Your statement about pruning, for instance is no longer correct for the current beta releases - there are now the options --max-unused=unlimited or --repack-cacheable-only)

Thanks a lot @cdhowie for that summary, that’s very useful as a starting point to anyone who wants to consider the AWS/S3 stuff for their repositories.

Thanks. I need to consolidate all of this information in one post. I also calculated the break-event point in terms of “restores per year” for S3 GDA vs B2:

The summary is that if you perform a full restore approximately every two years, the cost is the same. If you restore less frequently, S3 GDA is cheaper. If you restore more frequently, B2 is cheaper.

I still prefer B2, even though I’ve never had to restore. There are no early deletion fees; all of the storage is “hot” so you don’t have to wait for a 5+ hour restore operation to access your data, and you can also prune regularly; and you can regularly run check operations if you want to. GDA is not so much cheaper that I am willing to trade all of that.