I know some people in this community are interested in seeing LLMs get better at Julia. But, you can’t make any progress in machine learning without a good benchmark.
We have started work on a new LLM benchmark that supports Julia. It is very early, but I think it is already much higher quality than prior efforts (including my own prior work on MultiPL-E). Moreover, I think the benchmarking methodology makes it much particularly easy to add new problems. The latter is really important, because writing a good benchmark is painful!
There is more information in the repository readme, including some preliminary results. If others are interested, I’d be happy to work together. I’m hopeful this will be a useful community resource.