Poor Impression on Agentic Development in Julia with Codex. Skill issue?

I want to share my recent experience using advanced AI models for software development, and I would be very interested to hear from others in the Julia community to see whether I can improve my workflow.

My experience so far has been poor, despite trying to follow the usual recommendations for making AI tools effective. The project I used as a test case was a small numerical continuation library based on the pseudo-arclength method for computing stationary solutions of differential equations.

The basic library was already implemented. My goal was to make targeted improvements: choosing better algorithms for computing the tangent vector in the predictor step, reducing allocations, implementing bordered solvers, and similar refinements. I had a clear idea of what I wanted to improve, so I thought this would be a good use case for AI assistance.

I started by creating an AGENTS.md file for Codex with GPT-5.4, partly with the model’s help so it could gain context about the project (the model also haves the mcp server to run julia code independently). I first asked broadly what improvements it would suggest, and then guided it toward specific areas, such as the predictor step. Even when I explicitly said that alternative numerical methods were welcome, it mostly defaulted to suggesting allocation reductions. Worse, some of the changes it proposed introduced unnecessary buffers that quickly made the code more bloated.

After several iterations, each focused on a single well-defined improvement, I was still very dissatisfied with the results. Even in what I thought was a favorable setup, the output left a lot to be desired.

So my question is: does this workflow improve significantly if I keep correcting the model, expanding the .md context files, and adding more skills and guidance? Even if it does improve, it still seems like a lot of overhead, resulting in possibly more work than just implementing the changes myself.

I have no doubt these tools are very effective for other tasks, such as building websites. But when it comes to implementing and improving numerical algorithms, my impression is that they are not there yet. For repeatable tasks, debugging stack traces, or finding straightforward bugs, they often work well and do save time. But for actual development, I have not found them especially time-saving.

I would love to hear other people’s experiences, especially to know whether what I am seeing is normal or just a skill issue on my part.

5 Likes

Maybe JuliaHub can donate, or sell, high-quality datasets of numerical Julia programs to OpenAI and Anthropic to train GPT-6 and Claude-5. That would be the ultimate solution.

I find the cursor Debug mode with auto model selection already work well for me with data processing tasks without any skill configurations. Basic algorithms work. It’ s already great productivity boost for me. Haven’ t got a chance to try on advanced numerical methods like yours, though.

1 Like

Yes, I feel that for data set exploration, even creating some GUI for visualization, these tools can help iterate a lot.

But, after reading some of the “advances” people have been reporting over the last few months, I was expecting that, after spending some time refining the agentic workflow, it would be better at suggesting nice APIs for the library, algorithmic changes to use a better tailored numerical method for the problem, etc. But even with a lot of guidance, it seems that is not capable of that and, as I said, in performance most suggestions are related to common themes such as reducing allocations.

For debugging is true that I have a great experience because usually is very good at reading errors and identifying the cause, which can save time, but I was expecting something more and I am not sure if I am the one at fault here.

3 Likes

I find that agentic coding with Julia works well when you focus on boilerplate glue type tasks, but these systems struggle a decent amount reasoning about allocations, automatic differentiation, and using up to date best practices working with many packages.

Using Augment Code w/ Sonnet 4.6 I put together a training data pipeline which took some of the high performance image routines I wrote by hand and connected and provided an HTTP endpoint PyTorch could call to load batches through shared memory. A huge part of this project was making Julia interface with Python, and Claude basically one shot this aspect of the project. It wrote and compiled ProtoBuf definitions and used Oxygen.jl for all of the service glue code and did a good job with other aspects such as managing the cache and image processing. Going back and forth a few times with Claude, it reduced the execution time of a key planning algorithm that made everything actually work by around 100x. I’d say this project was a big success for agentic coding, but I also did quite a bit of manual optimization by hand.

This week I’ve been working on speeding up NeuralODE training for a specifc use case so I’m working with Lux.jl. I find Claude still uses older conventions when generating this type of code, the structure it generated was suboptimal, lacking the modularity and type stability provided through use of @compact. Asking it for suggestions about speeding things up / different sensitivity and adjoints was basically pointless. It did a really good job taking care of the interface to Python however. Around 200-300 lines of boilerplate interface code loading an HDF5 file that contained all the prepared training data and weights and also writing back the trained weights when it was done.

As for writing actual high performance algorithms (not anything novel, well established ones), usually it takes a few iterations, and its better to focus on self contained pieces that you can independently validate the performance on before moving to the next.

Not all of these problems are easily solveable, but the issue of not using up to date best practices / interfaces in some packages can be mitigated through skills that point to the relevant documentation or setting up an llms.txt. I tried the llms.txt approach for the package I maintain (gRPCClient.jl) and it seems to work pretty well. When I need to code gRPC client stuff, the agent spends way less time digging through all of the code and documentation and gets right to coding. The issue is I couldn’t find an easy way to make an agent load the llms.txt unless instructed to do so in AGENTS.md / CLAUDE.md or putting it in a skill. It would be great if there was some sort of standard way agents could interact with Julia package documentation.

2 Likes

I’ve only had success at generating API docs with AI. It’s also been useful to come up with partial solutions or to quickly discard avenues I was already suspicious of. It’s also competent at being a better autocomplete.

1 Like

I’ve been using Claude CLI for about a month now and now am able to successfully implement large changes in codebases with it (e.g. adding database caching + logic for cache misses, AST evaluation, etc) by first drafting a PLAN.md (or similar document, the name doesn’t matter) with requirements, etc. Then I ask Claude to refine my plan, develop interfaces, etc. Often I will go multiple days just iterating on the planning markdown document until I’m satisfied with it, before I ask it to generate any code. The planning process is very helpful for both me and the LLM. I often find things which I would not have considered before, and also have the chance to identify plenty of oversights from the LLM. If you’re going to generate code stochastically, best to make the distribution of samples as narrow as possible :slightly_smiling_face:

14 Likes

I’ve used codex (with GPT-5.4 high) quite successfully recently. But mostly as an interactive rubber duck to help with design ideas. I have used no AGENTS.md or anything like that.
I’ve had most success by first “discuss” with it the design, once it has a narrow plan, it then is quite good at implementing it.
It also surprised me as it could debug a trimming issue that had been bugging me for quite a bit and that I could not address easily due to the stack traces being intractable by hand.

I don’t let it run crazy, It needs to have a good frame of what exactly you are trying to achieve, how do you think of approaching the issue, what is important in terms of tradeoffs, and ask it to review/propose/refine existing ideas. If left to its own devices it tends to spin a bit out of control and only reach decent results after many iterations of implement/test.

Another very useful usecase is as a very good and advanced completion engine. You give it the idea of a method on one type by half implementing a stub, and ask to complete this also for a set of similar types and it works wonders, cuts through hours or work.

Also, very useful to check that the documentation is in line with the current state of the code, or to generate documentation.

Definitely not for vibecoding though…

If you’re not using Kaimon.jl to work on Julia code with agents … well to be honest I think you’re crazy! :wink:

I wouldn’t do it any other way!

3 Likes

Thank you for pointing this out. I will definitely give it a try.

Still, I am not sure if this will cover my desired use case, where the agent is smart enough to really recommend better numerical algorithms in different parts of my program, and code them in a way that do not bloat code.

But the user experience of using the agents seems that it could be very nice with Kaimon.jl.

2 Likes

Agent models often need to iterate on the code to get to good results. Providing the means to quickly and efficiently (with respect to time and tokens) do this iteration has provided huge benefits in my experiences. Also, use the best models (Opus 4.6 1M token context), use multiple models. Kaimon provides a way to connect a Julia session to multiple agents at the same time.

This is still a very nascent and developing area in software development. It’s not perfect, it’s at times surprisingly effective and at times surprisingly frustrating. However I think it will likely only get better in time. Kaimon makes it better now. If you use good models and direct them to use the tools, they will figure out the code structure faster and more efficiently, and be able to perform experiments more quickly. Install the tools for semantic search, ollama and qdrant. This alone saves massive amounts of time and tokens allowing the agent to quickly pinpoint what is implemented where.

If you have any questions about setup, please feel free to reach out here on the chat, or provide feedback on the GitHub project.

5 Likes

I noticed this yesterday when I was trying to make my dudt! not allocating for use with Enzyme.jl. The LLM was completely clueless until it actually got the allocation profile feedback, and then it immediately fixed the issue. Going to have to try out Kaimon.jl. I’m guessing I could setup a “profile and reduce allocations” skill with it pretty easily.

For anything more complex than a single AI chat session or single commit, I use speckit. This makes a huge difference.

1 Like

This is my general approach too (in Claude) and has worked well in my experience, with the added step that I also often have Codex review the plan iterations and code implementations, and I then give its (modified) feedback back to Claude. Codex consistently finds oversights and missing test cases from the plans.

I also find that for higher level numerical algorithm planning and mathematical discussion (i.e. before I’m even at the stage of wanting to implement something in Julia) that I prefer working with Gemini. It seems to work the best at that level. It is also very good at explaining pieces of papers to me (i.e. if I am struggling with a section of a paper that I feel is poorly written or missing too much detail of a method it can often give a much more coherent explanation than the paper itself).

4 Likes

A post was merged into an existing topic: [ANN] Kaimon.jl — opening the gate between AI agents and Julia

Everything should always be measured in work-precision. How much work did you put in, what accuracy are you expecting, and what was your query? I find a lot of people who say agents don’t work are also just trying the “AI plz fix” kinds of queries.

I shared some of the queries I use: What Agentic AI "Vibe Coding" In The Hands Of Actual Programmers / Engineers - Stochastic Lifestyle

For example, OrdinaryDiffEq.jl’s FBDF:

OrdinaryDiffEq.jl's FBDF and QNDF currently uses the Hermite
interpolation fallback for its dense output / interpolation.
However, these have a well-defined interpolation on their k
values that should be used. For example, FBDF has the Legrange
interpolation already defined and used in its nonlinear solver
initial point
https://github.com/SciML/OrdinaryDiffEq.jl/blob/4004fc75dff0
9855bb96333f02d4ce0bb0f8c57c/lib/OrdinaryDiffEqBDF/src/
dae_perform_step.jl#L418.
This should be used for its dense output. While QNDF has it
defined here:
https://github.com/SciML/OrdinaryDiffEq.jl/blob/4004fc75dff0
9855bb96333f02d4ce0bb0f8c57c/lib/OrdinaryDiffEqBDF/src/
bdf_perform_step.jl#L935-L939 .
If you look at other stiff ODE solvers that have a specially
defined interpolation like the Rosenbrock methods, you see an
interpolations file
https://github.com/SciML/OrdinaryDiffEq.jl/blob/4004fc75dff0
9855bb96333f02d4ce0bb0f8c57c/lib/OrdinaryDiffEqRosenbrock/
src/rosenbrock_interpolants.jl
with a summary
https://github.com/SciML/OrdinaryDiffEq.jl/blob/4004fc75dff0
9855bb96333f02d4ce0bb0f8c57c/lib/OrdinaryDiffEqRosenbrock/
src/interp_func.jl
that overrides the interpolation. Importantly too though, the
post-solution interpolation saves the integrator.k which are
the values used for the interpolation
https://github.com/SciML/OrdinaryDiffEq.jl/blob/4004fc75dff0
9855bb96333f02d4ce0bb0f8c57c/lib/OrdinaryDiffEqRosenbrock/
src/rosenbrock_perform_step.jl#L1535.
If I understand correctly, this is already k in FBDF but in
QNDF this is currently the values named D. The tests for
custom interpolations are
https://github.com/SciML/OrdinaryDiffEq.jl/blob/4004fc75dff0
9855bb96333f02d4ce0bb0f8c57c/test/regression/
ode_dense_tests.jl
Search around for any more Rosenbrock interpolation tests as
well. This should make it so that savevalues! always uses the
interpolation
https://github.com/SciML/OrdinaryDiffEq.jl/blob/4004fc75dff0
9855bb96333f02d4ce0bb0f8c57c/lib/OrdinaryDiffEqCore/src/
integrators/integrator_utils.jl#L122
while if dense=true (i.e. normally when saveat is not
specified) the interpolation is then done on sol(t) by using
the saved (sol.u[i], sol.t[i], sol.k[i]).

and one for SciMLSensitivty:

The SciMLSensitivity.jl callback differentiation code has an
issue with the design. It uses the same vjp calls to
`_vecjacobian!` but its arguments are not the same. You can
see this here
https://github.com/SciML/SciMLSensitivity.jl/blob/master/
src/callback_tracking.jl#L384-L394
where the normal argument order is
(dλ, y, λ, p, t, S, isautojacvec, dgrad, dy, W)
but in the callback one it's putting p second. This is
breaking to some of the deeper changes to the code, since
for example Enzyme often wants to do something sophisticated
https://github.com/SciML/SciMLSensitivity.jl/blob/master/
src/derivative_wrappers.jl#L731-L756
but this fails for if y is now supposed to be a p-like
object. This is seen as the core issue in 4 open PRs
(https://github.com/SciML/SciMLSensitivity.jl/pull/1335,
https://github.com/SciML/SciMLSensitivity.jl/pull/1292,
https://github.com/SciML/SciMLSensitivity.jl/pull/1260,
https://github.com/SciML/SciMLSensitivity.jl/pull/1223)
where these all want to improve the ability for p to not be
a vector (i.e. using the SciMLStructures.jl interface
https://docs.sciml.ai/SciMLStructures/stable/interface/ and
https://docs.sciml.ai/SciMLStructures/stable/example/)
but this fails specifically on the callback tests because
the normal spot for p is changed, and so it needs to do this
interface on the other argument. This is simply not a good
way to make the code easy to maintain. Instead, the callback
code needs to be normalized in order to have the same
argument structure as the other codes.

But this was done for a reason. The reason why p and dy are
flipped in the callback code is because it is trying to
compute derivatives in terms of p, keeping y as a constant.
The objects being differentiated are
https://github.com/SciML/SciMLSensitivity.jl/blob/master/
src/callback_tracking.jl#L466-L496.
You can see `(ff::CallbackAffectPWrapper)(dp, p, u, t)`
flips the normal argument order, but it's also doing
something different, so it's not `u,p,t` instead its `p,u,t`
but it's because it's calculating `dp`, i.e. this is a
function of `p` (keeping u and t constant) and then computing
the `affect!`'s change given `p`, and this is what we want
to differentiate. So it's effectively hijacking the same
`vecjacobian!` call in order to differentiate this function
w.r.t. p by taking its code setup to do `(du,u,p,t)` and
then calling the same derivative now on `(dp,p,u,t)` and
taking the output of the derivative w.r.t. the second
argument.

But this is very difficult to maintain if `p` needs to be
treated differently since it can be some non-vector argument!
So we should normalize all of the functions here to use the
same ordering i.e. `(ff::CallbackAffectPWrapper)(dp, u, p, t)`
and then if we need to get a different derivative out of
`vecjacobian!`, it should have a boolean switch of the
behavior of what to differentiate by. But this would make it
so SciMLStructures code on the `p` argument always works.

Now this derivative does actually exist, the `dgrad` argument
is used for the derivative of the output w.r.t. the p
argument, but if you look at the callback call again:
  vecjacobian!(
      dgrad, integrator.p, grad, y, integrator.t, fakeSp;
      dgrad = nothing, dy = nothing
  )
it's making dgrad=nothing. The reason why it's doing this is
because we only want that derivative, so we effectively want
the first argument (the normal derivative accumulation ddu) to
be nothing, but `vecjacobian!` calls do not support that? It
seems like they do have dλ=nothing branches, so it should work
to flip the arguments back to the right ordering and then just
setup to use the dgrad arguments with a nothing on the dλ, but
this should get thoroughly tested. So do this refactor in
isolation in order to get all of the callback tests passing
with a less hacky structure, and then the SciMLStructures PR
should be put on top of that. All 4 of those PRs should be
able to be closed if the p just supports the SciMLStructures
(they are all almost the same).

On these kinds of queries, the PR is quite usable the first time.

19 Likes

I agree. With well-designed prompt, using a document for example where you define all the requisite/constraint of the project, makes the agents really powerful. A simple “Do this, do that” only work for small specific tasks.

Using skill.md files is also very helpful imho.

Anyone heared someone sell their datasets to LLM providers? I thought they just take it as granted.

2 Likes

New York Times and Reddit have content licensing deals with AI companies. If the maker of an LLM cares about Julia enough, it could certainly employ or pay Julia devs to provide feedback and fine-tune the LLM’s code output.

2 Likes