Also, may be a good add to the docs - I believe the new pytorch implementations with M-series chipsets have made the LLM steps wildly faster than originally experienced on these machines. Didn't bother to try it at first given one of the notes I read about CUDA optimized runs (especially for flag passage reranking, etc.), and remoted into a Windows WSL2 instance to run it. Turns out, x3 times faster on my M3 Macbook (no memory pinning to contend with there too).
Might be a nice s/o to that userbase if they're still using workarounds for their eval
Pay now to fund the work behind this issue.
Get updates on progress being made.
Maintainer is rewarded once the issue is completed.
You're funding impactful open source efforts
You want to contribute to this effort
You want to get funding like this too