[Help Wanted] Help contribute test cases to improve LLM performance on Julia code

Hi all,

For the past few weeks, I’ve been working on creating a set of benchmark test cases that will be used to evaluate and train LLMs to improve their performance on Julia code. I’m particularly interested in test cases that people have tried to use AI agents on, but they barely failed or appear to be just beyond the capability frontier of current leading-edge agents. This’ll all be open source in the medium term future, but at the moment I’m keeping it a little smaller to make sure I can help people get the test cases write and provide API credits to get pass rates, etc. If you’re interested in participating, please ping me on slack.

Thanks!

6 Likes

LLMs tend to write slightly outdated code when it comes to Flux.jl, as the package API changed a bit over the last few years. GPT-5 gives me code that declares custom layers with @functor rather than @layer, but the latter is recommended in recent versions of Flux.jl. Generally speaking, fine-tuning LLMs to write “modern” Julia may be useful.

Write me a test case for it and that’ll happen :).

Is gives bad code for DataFramesMeta.jl , it ends up being some weird mis-mash of DataFramesMeta and dplyr.

Prompt to GPT-5:

Write a minimal example for defining a custom layer in Flux.jl. Only give code not explanations.

The response was (WARNING: invalid code below)

using Flux

struct MyLayer
    W::Matrix{Float32}
    b::Vector{Float32}
end

Flux.@functor MyLayer

MyLayer(in, out) = MyLayer(Flux.glorot_uniform(out, in), zeros(Float32, out))

(m::MyLayer)(x) = m.W * x .+ m.b

m = MyLayer(3, 2)
x = rand(Float32, 3, 5)
y = m(x)

gs = gradient(params(m)) do
    sum(abs2, m(x))
end

The code has a minor error, params instead of Flux.params, and running the corrected code triggers 3 deprecation warnings from Flux. My attempt at a modernized version is

using Flux

struct MyLayer
    W::Matrix{Float32}
    b::Vector{Float32}
end

Flux.@layer MyLayer

MyLayer(in, out) = MyLayer(Flux.glorot_uniform(out, in), zeros(Float32, out))

(m::MyLayer)(x) = m.W * x .+ m.b

m = MyLayer(3, 2)
x = rand(Float32, 3, 5)
y = m(x)

gs = Flux.withgradient(m) do model
    sum(abs2, model(x))
end

This project is now public at GitHub - JuliaBench/JuliaBench: LLM Benchmark problems for SWE tasks in julia - please feel free to submit PRs even with WIP problem definitions. I can help get the graders working and the pass rates tuned.

4 Likes