I know some people in this community are interested in seeing LLMs get better at Julia. But, you can’t make any progress in machine learning without a good benchmark.
We have started work on a new LLM benchmark that supports Julia. It is very early, but I think it is already much higher quality than prior efforts (including my own prior work on MultiPL-E). Moreover, I think the benchmarking methodology makes it much particularly easy to add new problems. The latter is really important, because writing a good benchmark is painful!
There is more information in the repository readme, including some preliminary results. If others are interested, I’d be happy to work together. I’m hopeful this will be a useful community resource.
On the idioms angle — what started the original fine-tuning thread was the failures general models make that aren’t wrong, just un-Julian. BigCodeBench-MultiPL is pass@1 functional correctness, so a correct-but-un-idiomatic answer scores the same as an idiomatic one. Is idiom-sensitivity something we’d want in scope?
Separately, I have an RTX 6000 and I’ve been fine-tuning small models (1–8B) — happy to run candidates against this benchmark and report back. And if enough people want it, I’m open to bootstrapping a from-scratch cloud train and releasing the weights OSS. Thus gauging interest now before committing the compute. Preferably something that would fit my local setup in 4-bit quant.
I could commit compute towards this. I do not have time to work on the optimization and design of the system, but if you have the system ready to train I can run the training and give you back the results.
We’re only starting to see benchmarks that test if code is idiomatic. I think the latest FrontierCode benchmark does that for Python. But it’s definitely interesting to do.