Hi all,
For the past few weeks, I’ve been working on creating a set of benchmark test cases that will be used to evaluate and train LLMs to improve their performance on Julia code. I’m particularly interested in test cases that people have tried to use AI agents on, but they barely failed or appear to be just beyond the capability frontier of current leading-edge agents. This’ll all be open source in the medium term future, but at the moment I’m keeping it a little smaller to make sure I can help people get the test cases write and provide API credits to get pass rates, etc. If you’re interested in participating, please ping me on slack.
Thanks!