Fund: asyncio.gather() adds significant overhead

We're in the process of migrating from https://github.com/quora/asynq to asyncio using dataloaders. Without getting too into the details, asynq provides the same general premise of batching and caching as dataloaders do, but it uses python generators and a custom scheduler to achieve it over the asyncio event loop.

In our fairly large application switching to asyncio dataloaders adds significant overhead (multiple seconds of e2e latency), and this comes down to the fact that the overhead of asyncio.gather is significant.

Consider the following trivialized example:

import asyncio
import dataclasses
import time
from strawberry.dataloader import DataLoader


@dataclasses.dataclass(frozen=True)
class Result:
    x: int = 0


async def load_fn(keys):
    await asyncio.sleep(0.1)  # Synthesize network latency
    return [Result(i) for i in keys]


async def non_gather_version(size, loader):
    awaitables = []
    for i in range(size):
        awaitables.append(loader.load(i))
    result = []
    for a in awaitables:
        result.append(await a)
    return result


def main():
    r = 100000
    loader = DataLoader(load_fn=load_fn, cache=False)
    loop = asyncio.get_event_loop()

    t0 = time.time()
    loop.run_until_complete(loader.load_many(range(r)))
    print("Asyncio Gather Version:", time.time() - t0)

    t0 = time.time()
    loop.run_until_complete(non_gather_version(r, loader))
    print("Non Gather Version:", time.time() - t0)

    t0 = time.time()
    loop.run_until_complete(load_fn(range(r)))
    print("Direct Version:", time.time() - t0)

if __name__ == "__main__":
    main()

On my modest aws machine I get the following results:

Asyncio Gather Version: 1.255368947982788
Non Gather Version: 0.6231565475463867
Direct Version: 0.1942920684814453

The overhead from gather is readily apparent, and we have experienced significantly worse in our non trivial workloads. Our codebase is designed around these wide and deep call hierarchy fan-outs where we use multiple dataloaders at various depths in our call stack. At each fanout point we consolidate the child awaitables with a call to gather. When this is scaled up to the hundreds of thousands of objects (mostly duplicated or cached with dataloaders), we incur a performance penalty we cannot bear.

I don't understand how to use dataloaders in a performant fashion, it seems for them to be useful it necessitates that they are available at multiple levels of the call stack (else you would just make a single batched call directly yourself), but this also makes them very expensive, compared to both calling batches directly or using the generator approach the referenced asynq uses above.

Strawberry GraphQL/strawberry

asyncio.gather() adds significant overhead

How does funding with Polar work?

Backer

Contributor

Maintainer

Strawberry GraphQL/strawberry

asyncio.gather() adds significant overhead

How does funding with Polar work?

Backer

Why does "Fund on completion" require GitHub login?

When is the invoice due for "Fund on completion"?

What happens if the issue is never completed?

Do I get any extra benefits by funding?

Do I get progress updates?

Contributor

Do I get a reward?

Is rewards guaranteed?

Maintainer

How can I get funding like this for my open source initiatives?