Reimplementing restic ls: Trouble with empty directories in snapshot

dsflsdflk · December 20, 2023, 8:24pm

I’m currently trying to write a Python script which is able to read a restic snapshot and list its contents.
When I try to read the blob of the subtree of an empty directory, however, I get an invalid JSON:

b'(\xb5/\xfd\x00\x00i\x00\x00{"nodes":[]}\n'

I can clearly see that the relevant part ({"nodes": []}) is right at the end of the received data, but I don’t understand the bytes before that part.

Does somebody know where these bytes come from and how I should interpret/handle them?

rawtaz · December 20, 2023, 8:39pm

Naturally you want to show the actual code that is giving the error, and the complete error message/trace

dsflsdflk · December 20, 2023, 8:44pm

Absolutely! Unfortunately it is already quite long …
I was hoping that showing this part would already give sufficient information, but I will try and see if I can give you a minimal working (or rather not working) example without spamming this post with too much irrelevant code.

rawtaz · December 20, 2023, 8:56pm

Maybe we can keep it simple - can you reproduce it with a minimal case where you have the same data in a variable and run the JSON decoding on that, and get the error? Then you can just show that code.

dsflsdflk · December 20, 2023, 10:48pm

This is my attempt of a (kind of) minimal (kind of) reproducible example:

Longer explanation

The restic test repository was encrypted with the password "abc". As a test snapshot I backed up a single empty directory named "a" with restic.

What my script currently does is, it gets the snapshot data:

{'time': '2023-12-20T21:05:01.168872983+01:00',
 'tree': '2a2d34b656d01f27573b2978fd592b07cf65fdbf2a6c48897bd2a9d5a8f7f20e',
 'paths': ['/.../test'],
 'hostname': '...',
 'username': '...',
 'uid': 1000,
 'gid': 1000}

then reads the blob specified in the “tree” field of snapshot data and builds a tree object from it:

TreeData(
    nodes=[
        NodeData(
            name='a',
            type='dir',
            ...
            subtree='ac08ce34ba4f8123618661bef2425f7028ffb9ac740578a3ee88684d2523fee8',
            ...
        )
    ]
)

In order to read the above subtree (ac08…) I check the index and find the following information for ac08…:

{
    "packs":[
        {
            "id":"96e29d9f2abfc229087e8e2c2baf39eb17bb85b13d8630d0c0c6550626aedecf",
            "blobs":[
                {
                    "id":"ac08ce34ba4f8123618661bef2425f7028ffb9ac740578a3ee88684d2523fee8",
                    "type":"tree",
                    "offset":0,
                    "length":54,
                    "uncompressed_length":13
                },
                ...
            ]
        }
    ]
}

(To verify I also use restic cat index and get the same result)

I see that the subtree is contained in the pack 96e29d9f2abfc229087e8e2c2baf39eb17bb85b13d8630d0c0c6550626aedecf with offset 0 and length 54.

The encrypted tree blob looks like this (in base64):

WIKi5myToDDC/nLhW8G3l3Mld8uppLNnD2hpLsZWFldYuSQmmhmfyJVayJWaWnaYXdtnX+fC

I decrypt it using the following master key:

{
  "mac": {
    "k": "2w1lKAZ7BnBJPE7OiWmTeg==",
    "r": "jvqsBGjynwHk0kIBdBltAQ=="
  },
  "encrypt": "Sjwm07HzbUqo+UqBapT5TzGbBnaWyzhZi5zgzsDUmdw="
}

After successful decryption and MAC authentication (in Python) I get the following raw data for the subtree:

b'(\xb5/\xfd\x00\x00i\x00\x00{"nodes":[]}\n'

Loading this with json results in a UnicodeDecodeError:

import json
data = b'(\xb5/\xfd\x00\x00i\x00\x00{"nodes":[]}\n'
json.loads(data)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 1: invalid start byte

I hope this makes thing clearer.

alexweiss · December 21, 2023, 6:47am

You missed the step to decompress the blob…
Update: Obviously the string to compress is too short, so you still see the plaintext, but for longer trees, you most likely can no longer identify the content until you decompress it using zstd.

alexweiss · December 21, 2023, 7:48am

About the first 4 bytes: These are obviously the magic number of a Zstandard frame, see https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#zstandard-frames

dsflsdflk · December 21, 2023, 12:58pm

Thanks, this is indeed the problem. I was confused first because zstd decompression failed for these short trees and it looked to me like it was actually plaintext.

But it seems that I am doing something wrong when using the zstandard library in Python.
Currently it looks like this:

def _decompress(data: bytes) -> bytes:
    zdec = zstandard.ZstdDecompressor(max_window_size=2147483648)
    with zdec.stream_reader(data) as stream_reader:
        result = stream_reader.read()
    return result

But trying to decompress the above bytestring throws an error:

_decompress(b'(\xb5/\xfd\x00\x00i\x00\x00{"nodes":[]}\n')

zstd.ZstdError: zstd decompress error: Unknown frame descriptor

This should be solveable though. I will update as soon as I found the solution.