Prune performance

I am really looking forward @alexweiss’s fix to prune, but this original thread was comparing prune on Restic to Duplicity. I see a few ways that dulicacy is fast. This is just based on casual observation and may be wrong.

  • It is an index only operation
    In Duplicacy the remote file names are a hash of the contents and indexes are cached locally. After doing a full directory listing of the remote side and possibly downloading some “snapshot” files created by other hosts writing to the same repository the prune command has everything it needs to determine what to prune. @alexweiss’s new prune is similar.

  • It is more willing to tradeoff wasted space for performance
    Because of Duplicacy’s chucking model they don’t have pack’s like restic and don’t need to deal with partially populated packs when doing a prune. Instead the backup operation with always creates fully populated chunks and the “snapshot” equilivant will list the chucks needed for each backup. Chunks no longer referenced became stale. So if a back changes a small file in the middle of an existing chuck then a whole new chunk will be uploaded and the old chunk becomes stale. The trade-off on chunksize is internal fragmentation for large chucks and external fragmentation for small chunks. And large indexing metadata.

  • backups and prunes don’t need to be locked
    Backups on duplicacy are lock-free. The only thing a backup command can do is add files to a repository so parallel backups have no problem running together. A prune runs in two phases, first, it deletes any files that were marked for pruning over a week ago. Then it finds files that are no longer needed by the current backups and renames them to a fossil directory. Files have to live in that fossil directory for a week before being deleted and if a backup needs data from the fossil directory it will use it and move the file back out of that directory. This way as long as a single backup completes in less than a week prunes can run without locks. Very nice.

Any for most repository sizes the approach used by Dulicacy is a nice tradeoff.

1 Like