Eval pipeline SQL tag doesn't look too stable across all runs. We should look into the cause of instability - why is it such a different score from one run to another?
If all else fails, consider aggregating statistics to reduce noise.
Pay now to fund the work behind this issue.
Get updates on progress being made.
Maintainer is rewarded once the issue is completed.
You're funding impactful open source efforts
You want to contribute to this effort
You want to get funding like this too