[Help Wanted] Help contribute test cases to improve LLM performance on Julia code

Hi all,

For the past few weeks, I’ve been working on creating a set of benchmark test cases that will be used to evaluate and train LLMs to improve their performance on Julia code. I’m particularly interested in test cases that people have tried to use AI agents on, but they barely failed or appear to be just beyond the capability frontier of current leading-edge agents. This’ll all be open source in the medium term future, but at the moment I’m keeping it a little smaller to make sure I can help people get the test cases write and provide API credits to get pass rates, etc. If you’re interested in participating, please ping me on slack.

Thanks!

5 Likes

LLMs tend to write slightly outdated code when it comes to Flux.jl, as the package API changed a bit over the last few years. GPT-5 gives me code that declares custom layers with @functor rather than @layer, but the latter is recommended in recent versions of Flux.jl. Generally speaking, fine-tuning LLMs to write “modern” Julia may be useful.

Write me a test case for it and that’ll happen :).

Is gives bad code for DataFramesMeta.jl , it ends up being some weird mis-mash of DataFramesMeta and dplyr.