Poor Impression on Agentic Development in Julia with Codex. Skill issue?

I want to share my recent experience using advanced AI models for software development, and I would be very interested to hear from others in the Julia community to see whether I can improve my workflow.

My experience so far has been poor, despite trying to follow the usual recommendations for making AI tools effective. The project I used as a test case was a small numerical continuation library based on the pseudo-arclength method for computing stationary solutions of differential equations.

The basic library was already implemented. My goal was to make targeted improvements: choosing better algorithms for computing the tangent vector in the predictor step, reducing allocations, implementing bordered solvers, and similar refinements. I had a clear idea of what I wanted to improve, so I thought this would be a good use case for AI assistance.

I started by creating an AGENTS.md file for Codex with GPT-5.4, partly with the model’s help so it could gain context about the project (the model also haves the mcp server to run julia code independently). I first asked broadly what improvements it would suggest, and then guided it toward specific areas, such as the predictor step. Even when I explicitly said that alternative numerical methods were welcome, it mostly defaulted to suggesting allocation reductions. Worse, some of the changes it proposed introduced unnecessary buffers that quickly made the code more bloated.

After several iterations, each focused on a single well-defined improvement, I was still very dissatisfied with the results. Even in what I thought was a favorable setup, the output left a lot to be desired.

So my question is: does this workflow improve significantly if I keep correcting the model, expanding the .md context files, and adding more skills and guidance? Even if it does improve, it still seems like a lot of overhead, resulting in possibly more work than just implementing the changes myself.

I have no doubt these tools are very effective for other tasks, such as building websites. But when it comes to implementing and improving numerical algorithms, my impression is that they are not there yet. For repeatable tasks, debugging stack traces, or finding straightforward bugs, they often work well and do save time. But for actual development, I have not found them especially time-saving.

I would love to hear other people’s experiences, especially to know whether what I am seeing is normal or just a skill issue on my part.

Maybe JuliaHub can donate, or sell, high-quality datasets of numerical Julia programs to OpenAI and Anthropic to train GPT-6 and Claude-5. That would be the ultimate solution.