[Help Wanted] Help contribute test cases to improve LLM performance on Julia code

Keno · October 8, 2025, 6:38pm

Hi all,

For the past few weeks, I’ve been working on creating a set of benchmark test cases that will be used to evaluate and train LLMs to improve their performance on Julia code. I’m particularly interested in test cases that people have tried to use AI agents on, but they barely failed or appear to be just beyond the capability frontier of current leading-edge agents. This’ll all be open source in the medium term future, but at the moment I’m keeping it a little smaller to make sure I can help people get the test cases write and provide API credits to get pass rates, etc. If you’re interested in participating, please ping me on slack.

Thanks!

greatpet · October 9, 2025, 9:52am

LLMs tend to write slightly outdated code when it comes to Flux.jl, as the package API changed a bit over the last few years. GPT-5 gives me code that declares custom layers with @functor rather than @layer, but the latter is recommended in recent versions of Flux.jl. Generally speaking, fine-tuning LLMs to write “modern” Julia may be useful.

Keno · October 9, 2025, 11:54am

Write me a test case for it and that’ll happen :).

pdeffebach · October 9, 2025, 12:09pm

Is gives bad code for DataFramesMeta.jl , it ends up being some weird mis-mash of DataFramesMeta and dplyr.

greatpet · October 10, 2025, 10:17am

Prompt to GPT-5:

Write a minimal example for defining a custom layer in Flux.jl. Only give code not explanations.

The response was (WARNING: invalid code below)

using Flux

struct MyLayer
    W::Matrix{Float32}
    b::Vector{Float32}
end

Flux.@functor MyLayer

MyLayer(in, out) = MyLayer(Flux.glorot_uniform(out, in), zeros(Float32, out))

(m::MyLayer)(x) = m.W * x .+ m.b

m = MyLayer(3, 2)
x = rand(Float32, 3, 5)
y = m(x)

gs = gradient(params(m)) do
    sum(abs2, m(x))
end

The code has a minor error, params instead of Flux.params, and running the corrected code triggers 3 deprecation warnings from Flux. My attempt at a modernized version is

using Flux

struct MyLayer
    W::Matrix{Float32}
    b::Vector{Float32}
end

Flux.@layer MyLayer

MyLayer(in, out) = MyLayer(Flux.glorot_uniform(out, in), zeros(Float32, out))

(m::MyLayer)(x) = m.W * x .+ m.b

m = MyLayer(3, 2)
x = rand(Float32, 3, 5)
y = m(x)

gs = Flux.withgradient(m) do model
    sum(abs2, model(x))
end

Keno · October 23, 2025, 6:06pm

This project is now public at GitHub - JuliaBench/JuliaBench: LLM Benchmark problems for SWE tasks in julia - please feel free to submit PRs even with WIP problem definitions. I can help get the graders working and the pass rates tuned.

sairus7 · November 10, 2025, 1:56pm

Copilot tends to paste boilerplate from well-knows libs that are already wrapped by Julia packages. E.g. when I’m using GitHub - JuliaImGui/CImGui.jl: Julia wrapper for cimgui it suggest to write full rendering code again.

Topic		Replies	Views
Flux: Machine Learning with Julia Machine Learning package , announcement	8	7909	March 3, 2017
The same network performs differently in Flux.jl and tensorflow Machine Learning performance	13	3120	December 18, 2019
FlexLayer: A Custom Layer with Different Activation Fcns, Non-negativity, and more New to Julia flux	0	575	July 3, 2020
Hints for an old programmer new to Julia General Usage	7	1387	December 13, 2018
Flux multi-cpu parallelism? New to Julia question , flux , zygote	9	2978	November 21, 2020

[Help Wanted] Help contribute test cases to improve LLM performance on Julia code

Related topics