Is it possible to split the backup of large data into smaller parts?

waldauf · July 15, 2021, 7:28pm

My most important data - mainly photos, own video production - are saved on the NAS. But I would like to keep the 3-2-1 rule. So I bought Google Drive (for curious because of the rclone mount and the price) and would like to backup (maybe archive is more accurate) these data there. From the Raspberry Pi I would like to run a scheduled backuping script to GDrive. But initial backuping means to backup almost 1,5 TB what is realy huge number.

This data is structured as follows (in square brackets is the example of dir size):

/data/foto             [900 GB]
  2000 and older       [500 MB]
  2001                 [100 MB]
  ~
  2021                 [100 GB]
/data/video/           [300 GB]
  dir1                 [10 GB]
  dir2                 [10 GB]
  dirN
/administration        [400 MB]

My first attempt was to back up these three directories together. I measured the first 5 hours and every hour I backuped around 5 GB. That means around 13 days of continuously backuping. Not a good solution with a high possibility of the problem i.e. with network flapping.

Does will this example work (in the backup example aren’t all needed flags)?

#1 part
restic backup /data/foto/200*

#2 part
restic backup /data/foto/201*

#3 part
restic backup /data/foto/202*

#4 part
restic backup /data/video/dir1

#5-N parts
restic backup /data/video/dir{2,3..N}

After finishing this initial backup I’ll run backuping regularly after a few days:

restic backup /data/foto /administration

Is it possible to split this data into smaller parts? Without the need for restructuring data saved in the Restic repository. Or is there some better procedure that I didn’t realize?

MichaelEischer · July 15, 2021, 9:08pm

If a backup fails midway then all data uploaded up to now is not lost, but still in the repository. A later backup run will have to read all files from disk again, but won’t have to upload them again.

Your idea of splitting up the backups should work. There’s one alternative you could try, which has the benefit that the final backup run doesn’t have to read everything again. You can run the backup command with additional paths and tell restic that it should still use a previous snapshot as starting point using the --parent <snapshot-id> flag. That could look like the following. If the directories contain millions of files then your approach might be faster.

# creates snapshot 12345678
restic backup /data/foto/200*
restic backup --parent 12345678 /data/foto/200* /data/foto/201*
...

And there’s an experimental PR you could try, which should allow failed backup run to resume from where it failed:

github.com/restic/restic

backup: Add resuming from aborted backups

restic:master ← aawsome:backup-resume

opened 01:10PM - 16 Jan 21 UTC

aawsome

+422 -128

What does this PR change? What problem does it solve? -------------------------…---------------------------- This is a follow-up of #3229. In this PR I made a PoC using the idea of user odin from the forum, see link below. With this PR, restic writes a temporary file (in the cache dir to the selected repo, subdir /resume/) which contains the list of already finished directories and the tree ID where it is saved in the repo. If the backup suceeds, the file is removed. At the start of a backup, restic tries to read the temporary file (if existing) and uses the contained trees as "additional parent trees" for the given directories. This means that w.r.t. already saved trees, this resumed backup is as fast as a follow-up backup using a parent snapshot. Note that this PR relies on the first commits of #3121 which should be merged first. Was the change discussed in an issue or in the forum before? ------------------------------------------------------------ closes #2280 alternative to #2960 https://forum.restic.net/t/quicker-interrupted-backup-resumption/3470/6 Checklist --------- - [x] I have read the [Contribution Guidelines](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#providing-patches) - [x] I have enabled [maintainer edits for this PR](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork) - [ ] I have added tests for all changes in this PR - I have not added documentation for the changes (in the manual) - there is also none for parent snapshots - [x] There's a new file in `changelog/unreleased/` that describes the changes for our users (template [here](https://github.com/restic/restic/blob/master/changelog/TEMPLATE)) - [x] I have run `gofmt` on the code in all commits - [x] All commit messages are formatted in the same style as [the other commits in the repo](https://github.com/restic/restic/blob/master/CONTRIBUTING.md#git-commits) - [x] I'm done, this Pull Request is ready for review

waldauf · July 18, 2021, 8:53am

I tried to use the --parent trick and I’m not sure if it worked. After creating two backups I still see two snapshost. Is it alright?

Overview of snapshots:
ID        Time                 Host         Tags          Paths
---------------------------------------------------------------------------------------------------
7fc1fe9c  2021-07-16 12:58:24  raspberrypi  rpi,etc,home  /data/foto/--=2002_a_min=--
                                                          /data/foto/--=2003=--
                                                          /data/foto/--=2004=--
                                                          /data/foto/--=2005=--
                                                          /data/foto/--=2006=--
                                                          /data/foto/--=2007=--
                                                          /data/foto/--=2008=--
                                                          /data/foto/--=2009=--

9ff825ff  2021-07-16 22:54:24  raspberrypi  rpi,etc,home  /data/foto/--=2010=--
                                                          /data/foto/--=2011=--

I thought that new dirs will be added to the parent snapshot: 7fc1fe9c.

rawtaz · July 18, 2021, 9:58pm

@waldauf Please include the complete command and output of restic when you ask about results from your backup runs or similar - they contain the information needed to be able to provide an answer, at least oftentimes. For example, the output is what indicates whether the parent snapshot you referenced was used or not.

Snapshots are points in time. When you make a backup, a new snapshot is created - snapshots are never modified. So no, the additional folders you backed up are not “added” to the previous snapshot.

torfason · July 19, 2021, 12:02pm

To expand slightly to what @rawtaz said, a snapshot is never modified and the --parent parameter never has any impact on how the backups and repository look. It only makes them faster.

But looking at the output (even without seeing the commands), it does look a bit like you ran the following:

restic backup /data/foto/200*
restic backup /data/foto/201* --parent=<parent>

When you should have been running:

restic backup /data/foto/200*
restic backup /data/foto/200* /data/foto/201* --parent=<parent1>
... <repeat as neccessary>
restic backup /data/foto --parent=<parentn>

In other words, to get to a full backup in smaller pieces, you add more and more folders in each backup, while using --parent to speed up the process. You do not create multiple snapshots, each with different folders and then combine them.

(And remember that because of deduplication, you are not wasting any space doing this. Any file present in the first backup will not be duplicated in the second.)

donisewell · July 20, 2021, 4:30pm

My initial backup to Google Drive was ~13TB. Took me over a month of non-stop uploading with failures (reboots, lost connection, etc.) in between. Restic chugged along after re-connecting/re-starting the backup and I finally got there.

I wouldn’t bother breaking it up.