There’d have to be intentional work done to make the “list object” request parallel, because it’s not as simple as “run a bunch of requests on different threads.”
The S3 ListObject call can only return 1,000 objects at a time. If there are more objects to list, it returns a marker value that you can pass to the next ListObject call to continue where the previous one left off. This means you don’t know how to list a page until you’ve listed the previous page, necessarily precluding any concurrency.
However, since restic stores packs under a name that’s the hex representation of the SHA256 sum of the contents, we can enumerate, say, a byte’s worth of prefixes concurrently (
ff). On small repositories, this can result in many more list operations being performed than is necessary, so there would need to be some kind of initial test to see if this is beneficial. Off the top of my head, a reasonable implementation could be:
- List all objects under
- If the result set is incomplete (requires more list API calls to fetch the full set) and the results contains any pack that doesn’t start with
data/00/ (that is, there are fewer than 1,000 packs starting with
00) then proceed as usual, performing only one operation at a time. There will likely be fewer than 256 API calls necessary to finish listing the objects.
- Otherwise, there are likely 1,000 or more packs with each prefix.1 In this case, start 256 parallel list operations – one for each prefix from
@fd0 Has any work been done in this area (making list object calls parallel)? If not, does this seem like a sane optimization?
1 This takes advantage of the fact that, as the number of files being hashed increases, where the contents are largely unpredictable, the distribution of the hash function output over its domain will approach uniformity. Therefore, we can reasonably conclude that if there are N packs with a particular prefix, there are likely to be approximately N packs with every other (non-intersecting) prefix.