We're in the process of migrating from https://github.com/quora/asynq to asyncio using dataloaders. Without getting too into the details, asynq provides the same general premise of batching and caching as dataloaders do, but it uses python generators and a custom scheduler to achieve it over the asyncio event loop.
In our fairly large application switching to asyncio dataloaders adds significant overhead (multiple seconds of e2e latency), and this comes down to the fact that the overhead of asyncio.gather is significant.
Consider the following trivialized example:
import asyncio
import dataclasses
import time
from strawberry.dataloader import DataLoader
@dataclasses.dataclass(frozen=True)
class Result:
x: int = 0
async def load_fn(keys):
await asyncio.sleep(0.1) # Synthesize network latency
return [Result(i) for i in keys]
async def non_gather_version(size, loader):
awaitables = []
for i in range(size):
awaitables.append(loader.load(i))
result = []
for a in awaitables:
result.append(await a)
return result
def main():
r = 100000
loader = DataLoader(load_fn=load_fn, cache=False)
loop = asyncio.get_event_loop()
t0 = time.time()
loop.run_until_complete(loader.load_many(range(r)))
print("Asyncio Gather Version:", time.time() - t0)
t0 = time.time()
loop.run_until_complete(non_gather_version(r, loader))
print("Non Gather Version:", time.time() - t0)
t0 = time.time()
loop.run_until_complete(load_fn(range(r)))
print("Direct Version:", time.time() - t0)
if __name__ == "__main__":
main()
On my modest aws machine I get the following results:
Asyncio Gather Version: 1.255368947982788
Non Gather Version: 0.6231565475463867
Direct Version: 0.1942920684814453
The overhead from gather is readily apparent, and we have experienced significantly worse in our non trivial workloads. Our codebase is designed around these wide and deep call hierarchy fan-outs where we use multiple dataloaders at various depths in our call stack. At each fanout point we consolidate the child awaitables with a call to gather. When this is scaled up to the hundreds of thousands of objects (mostly duplicated or cached with dataloaders), we incur a performance penalty we cannot bear.
I don't understand how to use dataloaders in a performant fashion, it seems for them to be useful it necessitates that they are available at multiple levels of the call stack (else you would just make a single batched call directly yourself), but this also makes them very expensive, compared to both calling batches directly or using the generator approach the referenced asynq uses above.
Pay now to fund the work behind this issue.
Get updates on progress being made.
Maintainer is rewarded once the issue is completed.
You're funding impactful open source efforts
You want to contribute to this effort
You want to get funding like this too