Reimplementing restic ls: Trouble with empty directories in snapshot

I’m currently trying to write a Python script which is able to read a restic snapshot and list its contents.
When I try to read the blob of the subtree of an empty directory, however, I get an invalid JSON:

b'(\xb5/\xfd\x00\x00i\x00\x00{"nodes":[]}\n'

I can clearly see that the relevant part ({"nodes": []}) is right at the end of the received data, but I don’t understand the bytes before that part.

Does somebody know where these bytes come from and how I should interpret/handle them?

1 Like

Naturally you want to show the actual code that is giving the error, and the complete error message/trace :wink:

Absolutely! Unfortunately it is already quite long … :smiley:
I was hoping that showing this part would already give sufficient information, but I will try and see if I can give you a minimal working (or rather not working) example without spamming this post with too much irrelevant code.

Maybe we can keep it simple - can you reproduce it with a minimal case where you have the same data in a variable and run the JSON decoding on that, and get the error? Then you can just show that code.

1 Like

This is my attempt of a (kind of) minimal (kind of) reproducible example:

Longer explanation The restic test repository was encrypted with the password "abc". As a test snapshot I backed up a single empty directory named "a" with restic.

What my script currently does is, it gets the snapshot data:

{'time': '2023-12-20T21:05:01.168872983+01:00',
 'tree': '2a2d34b656d01f27573b2978fd592b07cf65fdbf2a6c48897bd2a9d5a8f7f20e',
 'paths': ['/.../test'],
 'hostname': '...',
 'username': '...',
 'uid': 1000,
 'gid': 1000}

then reads the blob specified in the “tree” field of snapshot data and builds a tree object from it:

TreeData(
    nodes=[
        NodeData(
            name='a',
            type='dir',
            ...
            subtree='ac08ce34ba4f8123618661bef2425f7028ffb9ac740578a3ee88684d2523fee8',
            ...
        )
    ]
)

In order to read the above subtree (ac08…) I check the index and find the following information for ac08…:

{
    "packs":[
        {
            "id":"96e29d9f2abfc229087e8e2c2baf39eb17bb85b13d8630d0c0c6550626aedecf",
            "blobs":[
                {
                    "id":"ac08ce34ba4f8123618661bef2425f7028ffb9ac740578a3ee88684d2523fee8",
                    "type":"tree",
                    "offset":0,
                    "length":54,
                    "uncompressed_length":13
                },
                ...
            ]
        }
    ]
}

(To verify I also use restic cat index and get the same result)

I see that the subtree is contained in the pack 96e29d9f2abfc229087e8e2c2baf39eb17bb85b13d8630d0c0c6550626aedecf with offset 0 and length 54.

The encrypted tree blob looks like this (in base64):

WIKi5myToDDC/nLhW8G3l3Mld8uppLNnD2hpLsZWFldYuSQmmhmfyJVayJWaWnaYXdtnX+fC

I decrypt it using the following master key:

{
  "mac": {
    "k": "2w1lKAZ7BnBJPE7OiWmTeg==",
    "r": "jvqsBGjynwHk0kIBdBltAQ=="
  },
  "encrypt": "Sjwm07HzbUqo+UqBapT5TzGbBnaWyzhZi5zgzsDUmdw="
}

After successful decryption and MAC authentication (in Python) I get the following raw data for the subtree:

b'(\xb5/\xfd\x00\x00i\x00\x00{"nodes":[]}\n'

Loading this with json results in a UnicodeDecodeError:

import json
data = b'(\xb5/\xfd\x00\x00i\x00\x00{"nodes":[]}\n'
json.loads(data)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 1: invalid start byte

I hope this makes thing clearer.

You missed the step to decompress the blob…
Update: Obviously the string to compress is too short, so you still see the plaintext, but for longer trees, you most likely can no longer identify the content until you decompress it using zstd.

2 Likes

About the first 4 bytes: These are obviously the magic number of a Zstandard frame, see https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#zstandard-frames

1 Like

Thanks, this is indeed the problem. I was confused first because zstd decompression failed for these short trees and it looked to me like it was actually plaintext.

But it seems that I am doing something wrong when using the zstandard library in Python.
Currently it looks like this:

def _decompress(data: bytes) -> bytes:
    zdec = zstandard.ZstdDecompressor(max_window_size=2147483648)
    with zdec.stream_reader(data) as stream_reader:
        result = stream_reader.read()
    return result

But trying to decompress the above bytestring throws an error:

_decompress(b'(\xb5/\xfd\x00\x00i\x00\x00{"nodes":[]}\n')
zstd.ZstdError: zstd decompress error: Unknown frame descriptor

This should be solveable though. I will update as soon as I found the solution.