Policies around registering AI-created packages in General?

Some vibe-coded packages are awesome, others might be less so, so it’s not vibe-coding per se that’s the issue.

Vibe-coding has the potential to create code hallucinations. I think these are less likely to appear in code written entirely by humans. (Humans have plenty of other ways to include bad code!)

A recent example: A few weeks ago, a package was submitted for registration in the General registry. It contained a number of hallucinations. Some non-existent Luxor functions were being called simply because they were “needed” at that point; some existing Luxor functions were being called with incorrectly typed arguments, presumably because the function “ought to” take information in that particular format.

In all cases, the code would have errored if run. It had obviously not been visually checked by a human with sufficent knowledge, or tested for run-time errors.

Luxor’s a pretty simple package, and there’s a reasonable anount of documentation for it, but Claude was still able to hallucinate some sloppy code. The errors were easily spotted by a human familiar with this particular package, but probably wouldn’t have been noticed by the author supervising the agent’s activities. I suppose the human trusts the agent to do the right thing everywhere…

So my take is that new packages coded with an Agent could usefully be marked as such (although it’s often easy to tell), and checked for hallucinations.

But who by? Volunteers who don’t want their own packages to be the apparent cause of poor user experiences down the line might be happy to help review new packages that rely on their code. Otherwise, it’s the community’s job to continue to maintain the ecosystem and monitor proposed additions.

3 Likes

I bet you $1 million that I could write an AI driven package which would iterate through the >12k Julia packages, many in various states of disrepair, and fix bugs at a rate 10x - 100x faster and more reliably than any human reviewer could. I mean heck, who is to say I haven’t already applied that method to packages maintained by some of our outspoken AI critics on this thread?? :thinking:

1 Like

I mean heck, who is to say I haven’t already applied that method to packages maintained by some of our outspoken AI critics on this thread?? :thinking:

I like to think that everyone contributing to this discussion does so because they genuinely care about the quality of Julia packages in General. I’m not sure if your comment above is intended as such, but I think we should try not to move into an us-vs-them depending on how one believes these goals are best achieved.

12 Likes

Sometimes yes, sometimes no. It totally depends on the kind of bugs it is fixing. I asked to fix the 8 bugs I mentioned with the best models available. In 5 out of 8 cases, the fix was good, in the other 3, the fix fixed the symptom but not the root cause. You will also need to rule out false positives, which happen.

Not to say this is not a good idea in general actually if done well i.e. not to bother some people who don’t like the approach or don’t want to be inundated with things they are urged to take care of, one could create forks of all packages and create issues there, so that you don’t notify everybody. Then one can do an announcement post about this. It requires to burn really a lot of tokens though, so one should probably need some financial support. Or do the experiment with like only 100 packages. The thing would be asking the models to find new issues than the one already reported. Maybe the experiment will work well, maybe it won’t, I expect it does with the necessary guardrails, surely it will be a nice experiment.

No, it’s mainly just to point out how incorrect some beliefs are.

This is exactly what started the discussion. People in Tachikoma.jl release posts commenting that AI should not be used “that much”.

Yes, this is an example of how AI coding can lead to bad algorithms. On the other hand, we have a huge number of packages that are unmaintained, have critical bugs, implement breaking changes in minor releases, etc. The conclusion is: unless we 100% lock the Registry into a curated set of packages with peer review, I see no point in banning, advertising, or whatsoever packages that use AI.

3 Likes

I am also against having an agent in Registry looking for bugs and submitting fixes. Again, we are an open source community driven mostly by our free time. We are not discussing pushes to Julia’s main repository, but to our own packages. In this case, I defend that each one has the right to decide. The license is clear: no warranty.

1 Like

I need to remind everyone again that our existing policy makes a huge distinction between AI-assisted coding where a human guides the use of the AI and vibe-coding, where the human has no idea what is going on. The former is encouraged, the latter forbidden for registered packages.

Using AI to find and fix bugs is more than welcome. The SciML organization is doing that extensively, and it that is being cited as an appropriate use of AI in the guidelines.

12 Likes

Exactamundo! I don’t really mind if people want to discuss these topics. But I do actually mind when people want to come into the post I made to announce a cool project that I devoted time into and am giving as a gift to the community, criticize my project having never used it, tell me that I’m not doing software right, that I need to be a better software person by not using that “vibe” stuff, and that my software should have labels to warn people away because it’s likely ‘slop’ and comparable to genetically modified organisms in food.

2 Likes

Can we start a new thread to discuss whether we should require a warning label on Julia packages that do not incorporate any AI tooling in their construction, as a means of steering people away from projects that are more likely to have serious bugs that humans introduced into them and have never been reviewed by modern LLM agents?

1 Like

Can you please get over thinking this is about you? As I pointed out before, this thread was split off from the discussion of your package for a reason (and at my request). Your use of AI is appropriate and not a concern within the existing guidelines

This thread is simply a space for the community to discuss their views on LLMs playing an increasingly large role in software development. Clearly enthusiasm must be balanced with responsible use, and where that balance is is an appropriate point of discussion. People might have differences of opinion and be on different parts on the spectrum of optimism and pessimism, but let’s not lose sight of the big picture here

23 Likes

The big picture is: no one here is paying anyone for creating packages that turned Julia into this amazing ecosystem. Let’s always remember this. So, anyone must be free to be as enthusiastic as they want until we create a curated Registry. However, in this case, all packages must undergo inspection. Good AI vibe coded ones, bad AI vibe coded ones, human bad code, human good code, etc.

Excuse me? I was saying that @goerz going into Tachikoma.jl release post, which is an amazing package, telling @kahliburke:

I would urge you to curb your AI enthusiasm just a little bit, and remember that the fundamental design, the quality, and reliability of your package are still your responsibility,

Why do you disagree with that? Do you think this was a nice think to do in a community that is so inclusive?

That just isn’t true! The registry is curated. Not as curated as CRAN, but definitely more than, e.g., PyPI. It is managed as a community resource that is not free-for-all, but places responsibilities the package author. Package naming is heavily steered, and have extensive automated checks on all new registrations, plus an official three-day-waiting period for community review where anyone can and should flag issues with the quality of a package.

The “not reviewed/curated” language in the README is from a liability point of view: Nobody should rely on packages in the Registry being reviewed. But we certainly do review them _to some extent. [LLM slop along the lines of what @cormullion](Policies around registering AI-created packages in General? - #102 by cormullion pointed to) is a low-hanging fruit for human review, but unfortunately is hard to catch with automated tests.

Improper LLM usage is not the only quality issue that is routinely flagged during review: “This should be contributed to an existing package instead” is somewhat common. And a sloppy human-made translation of a Fortran program, without any tests would also be something that is flagged.

4 Likes

This is ofcourse crazy, but I don’t think vibecoding was the primary issue here. It was an enabler. Clearly this person did not care for either the Julia ecosystem or the quality of their own packages. Even if you’re vibecoding, it is very easy to setup a proper unit testing system with code coverage tools. BestieTemplate.jl will set this up for you for free, zero effort.

2 Likes

Vibe coding being an “enabler” is the primary issue here. That user submitted about two dozen vibe-coded packages in an hour. That kind of things wouldn’t have been possible without LLMs

5 Likes

At least in Portuguese, curated means that an individual or a board is analyzing all the items and selecting which one is worthy to be included in the list. I’m pretty sure this is not the case… This should be valid even for new releases, by checking the changes and seeing if they are breaking and following semantic versioning.

I saw (many times) packages announced here, got registered, and could not even compile. Of course, it was fixed right away, but if the Registry was actually curated in the correct sense of the word (at least in Portuguese), this would never have happened.

The tool is not the problem. It’s how the tool is used. Vibe coding is just a term we made up for heavy LLM assisted coding. What really happened there is some new form of spam.

2 Likes

Neither the new version of PrettyTables.jl v3… This is life: a new tool will lead to a new set of problems. However, it is not good to prohibit the tool (and I am meaning 100% vibe-coded packages) because of some bad actors.

Example: if I have a test suite that completes tests on an algorithm to estimate a system state in a satellite, I can write code for the estimator in Claude. I can avoid checking a single line of code. If this package passes all the tests, I am 100% sure it is working perfectly. Why should this be banned?