Policies around registering AI-created packages in General?

I did not address anyone in particular with that part of my comment. It was just a way to establish that even for an uncurated registry, certain behaviors are still prohibited to continue making my argument.

I’ll admit I haven’t read every post in this thread, but has this even been suggested? I don’t see that happening (except of course in egregious cases where it effectively is spam).

4 Likes

My feeling is that this thread has been mislabeled.

Registering an AI-created package as per the title? One that AI conceived the need for, implemented and went through the test-debug cycle before submitting it for registration? That would certainly be an interesting package. Maybe a useful one.

Any useful code doing anything of reasonable complexity likely contains bugs. We know that AI halucinates and produces buggy code, but we also know that humans have off days and write code containing bugs.

Humans have always used tools. Any tool can be misused and any tool can be dangerous, whether used correctly or incorrectly. Some tools require significant investments in time before they can be used effectively, even by experts. AI assistance is just another tool. If it’s a net positive, then I don’t understand why it wouldn’t be used.

$0.02

D.

5 Likes

I have a summary of the discussion to this point on my substack. I had posted the original here in violation of the community guidelines on AI-generated content, which I regret.

I think folks are almost universally agreed that genAI tools have helped some developers accelerate their projects and even produce projects that might otherwise have been outside their reach due to limits on time, etc.

Folks also seem agreed that even without genAI the registry contains a substantial number of unmaintained repos with critical bugs that erode the quality of the Julia ecosystem.

The reason that personally I find the the case for unrestrained exuberance about genAI tools to be weaker, substantially, than restrained enthusiasm and even pessimism, is that genAI changes nothing about the underlying incentives that create big public goods issues that plague open source ecosystems and many others (sorry, I’m a biologist, so this language is hard to avoid for me :slight_smile: ).

For example, the production of unmaintained and bug ridden code in the registry is done by folks who do not in the end have the resources or motivation to produce high quality contributions to the Julia ecosystem. If you give those folks genAI, they’ll just increase the number of their contributions/package by multiple of X_low than they would have otherwise. Folks who do have the resources / motivation to make high quality contribution will also accelerate their contributions by X_high. Some individuals will have a lower X_low than some individuals X_high, but the lack of motivation / restraint on the part of X_low individuals suggests X_low > X_high on average. Moreover, the bar to make contributions is lower with genAI, and new X_low contributors will outnumber new X_high contributions because their threshold for entry are lower. In the end, just like every public good, the quality of the ecosystem on average will very likely decline unless the ecosystem responds with additional regulation or “curation”.

And in fact the community has already responded with additional regulation! And naturally that additional regulation centers around fully “vibe-coded” contributions because those seem most likely for all reasons just described to be of lower quality than ones with unmistakable marks of human involvement.

So while I appreciate all the positives that some of our excellent contributors have experienced using genAI and thus understand their apprehension at restricting these tools, history and economic/biological theory are on the side of some and likely substantial restraint.

8 Likes

Assuming those contributions in case of new packages make it past @goerz. So probably not.

Is the quality of an ecosystem determined by an average? Do poor packages weigh down the ecosystem? I don’t think so. People stay in the Julia ecosystem because there are are enough high quality packages that meet their needs. The few poor packages that are floating around really don’t matter.

We may get inspiration from the new Submission Guidelines from arXiv.org especially the author registration via a process of endorsement, and some type of moderation that we may adapt for our needs.

I didn’t post that PR

Many apologies to all Large Language Models that felt offended by those grievous slights!

In my defense, I have abiding respect to all robots, and especially to those ones with shining metal skeleton and red eyes.

But that is subjective and places higher burden on the registry maintainers to argue
with an author whether he did put in sufficient effort. On the other hand, whether LLM has been used or not is objective and leaves no room for interpretation.

2 Likes

I would argue that it is not a tool in the sense we are used to.
Tools behave predictably and tend to be well-understood by their users. LLM do not share either of those properties. They are statistical in nature and are closed black boxes not under the control and ownership of their users.

In my experience, moderation by @goerz is pretty light and he tends to be among the more
agreeable OSS project maintainers.

5 Likes

Those comments are directed at people. But as all your responses so far have just been more mockery, I will stop responding now.

4 Likes

One of the community guidelines is “Don’t post generative AI outputs”.

7 Likes

Hi @mbauman. I think we are having a misunderstanding here. Probably, I am not being able to express my opinion clearly. Notice that all this discussion started because the contributions of @kahliburke were said to be “borderline with respect to” the guidelines. If this kind of work, with amazing quality, is marked as borderline regarding the guidelines, then the current situation is an absolute position: do not register vibe-coded AI packages.

Yes, that’s what I am trying to defend all along. Respect that people are using their free time to contribute to this community, and they are free to use any tool they want to achieve the goal. The edge cases of people spawning packages with AI are as unusual as people registering packages with bad human-created code.

From following this discussion, I think it would be appropriate to explicitly define the term “vibe coding” as people seem to have different ideas about its meaning. I subscribe to the Wikipedia definition of Vibe Coding:

Vibe coding involves accepting AI-generated code without reviewing it, instead relying on results and follow-up prompts to guide changes.

Maybe we can all agree that LLM-assisted coding IS NOT EQUAL TO vibe coding? Me trying to create a GUI in Java would be vibe coding because I would have zero idea what the code is as I have never looked at Java code in my life.

And based on this definition, I would guess that no one here is actually vibe coding. The packages in question all (I presume) have a human in the loop to ensure some level of code quality.

I would also like to push back a little bit on this argument “buggy LLM code is fine because humans also write horrible code sometimes”. Two wrongs don’t make a right.

Paraphrasing the Wikipedia article on Vibe coding, one study cited showed that primarily LLM-generated code on Github had 1.7x more major bugs and 2.74x more security vulnerabilities. Also, over 3 years as LLM’s improved code security did not. So LLM’s writing better code without a human reviewer is just not true.

BUT I repeat, I do not think anyone in this thread falls into this category of true vibe coding. I trust that everyone is doing their due diligence. Also, all the packages in question are already approved and in the registry, so clearly the Registry maintainers think so as well.

2 Likes

Yes! As I’ve said repeatedly:

The terms “vibe-coding” and “slop” are defined in our guidelines!

No, you are not responding to what was actually said, or what the actual policies are. You’re responding to a perceived “absolute position” that does not exists, with an apparent definition of “vibe-coding” that in no way matches the definition of vibe-coding that exists in the guidelines. Can you look at what the actual guidelines are and what has been said about their intent in this and the previous thread, and stop arguing against strawmen?

No, that is not what was said! What got a raised eyebrow was a comment

That is, merging PRs that were generated by LLMs without any human supervision. That is at least on the road (“borderline”) to vibe-coding, something that will lead to poor software quality if done at a large scale. It does not mean that the actual full Tachikoma package or @kahliburke’s development practices in their entirety are “borderline”.

I would note that the package was registered (which included my review, without being flagged), long before this entire discussion. The way the no-vibe-coding guideline gets applied in practice is that when I or anyone else reviewing registrations notices obvious and egregious slop, we can raise a flag and link to that policy without having to explain over and over again what exactly we expect for registered packages. It is not a blanket policy against LLM usage.

Great, because that’s what the guidelines actually are. So we’re all in agreement, and maybe we can stop arguing in circles?

6 Likes

Sure! I think we are now in the same page :slight_smile:

3 Likes

I don’t think it is as clear-cut. For example:

  • LLMs are used to develop Julia itself (#61313, #61316, or even Pull requests · JuliaLang/julia · GitHub), so banning it completely for packages feels hypocritical.
  • If an LLM finds an existing off-by-one error in the package, should it no longer be allowed? That seems odd. So the discussion is then how much of a package is allowed to be LLM-generated.
  • If an author says that the code is not LLM-generated, but it looks to be, there is still room for interpretation, and it causes a maintenance burden. If LLM-generated code is fully banned, people will likely start lying about it.
4 Likes

my opinion, probably on the “accelerationist” side of the spectrum, is that state of the art coding agents are now better than most humans at most tasks. especially if those tasks are of the more mundane or engineering-y variety and do not require any kind of specialized technical research.

note that this is a very recent opinion I’ve formed based on rapid advances in model quality and I did not believe this as recently as 6 months ago.

however, the biggest difference I’ve observed / experienced working with agents (mainly Claude) is that they tend to change the ratio of time spent writing : reviewing code to be heavily skewed towards the former.

Any author, man or machine alike, is going to make mistakes while writing code. this is 100% unavoidable. but as a human, I tend to find myself writing some code, then going back and reading what I wrote before, re-understanding it, reviewing, building incrementally, redesigning, etc. such that my total write : review time ratio by the end of the project might be 1:10

coding agents, on the other hand, stay in hot pursuit of the goal the prompter has set for them, and while I’m sure there are some internal review loops in the RL to try to enforce code quality, the writing time usually vastly exceeds reviewing time. this is also why I (and I observe other people doing similarly) get good results with manual review loops (“steering the ship”) of getting the agent to write some code, prodding it about the design, asking it to audit itself and reconsider, maybe using a 2nd model provider for a different set of weights to give another opinion, etc.

In that light, I view the Julia community’s norms & guidelines around AI code to really be about ensuring that the review time : write time remains high. as even if an agent can write a better PR than the original package author could, merging this without review would mean that the use of that agent has caused the resulting code to undergo less scrutiny than it would have during the natural course of development as a human would have given it.

2 Likes

I have not advocated banning any packages, but a simple disclosure of LLM usage.

In my opinion, as long as the bug fix is produced by a programmer, no disclosure is required.
Search for errors is the only use of LLMs that raises no concerns? provided they are verified and corrected by a person.

That’s verifiably false.

My position on AI could be described as Zizekian, on both the technical and political aspects. In other words, I dislike its impact on the environment, its potential to stymie human creativity, and independent creators, but also view it with optimistic caution.

Many of these arguments seem like echoes of the arguments made against the printing press, the opening of universities to non-nobles, the radio, the television, the internet. I’m sure back in the neolithic there were people objecting to writing down writs of trade on clay rather than keeping it all in your head.

As part of this cautious optimism, and for my own curiosity I asked @kahliburke to let loose the process of code review he described earlier on the package I’m developing, PortfolioOptimisers.jl.

I think this is a good test, because it is quite a complex codebase. It has incomplete, but still a lot of documentation, including examples, and has almost 90% test coverage[1]. So it seemed like a good idea to put an agent through its paces reading source code, docs, and tests to come up with some sort of overall code review.

I know this is different to code generation, but obviously what an LLM thinks code should look like, vs the code it produces, should be highly correlated.

I have to say, I’m impressed. It’s really not offered any new insights, but it’s pretty much found every single problem I have with the codebase. Unfortunately, it doesn’t usually offer better alternatives. It complains of the same things I complain about, but neither it nor I have found better alternatives.

It also suggests some suspect changes, I’m using the concrete struct pattern, and it’s suggesting i either remove types, or lump sets of parameters into smaller structs. However, the first loses type information for no benefit because dispatch does not usually happen at the struct level, but inside functions using only some fields of the struct. The second adds unnecessary friction without enabling reusability of the smaller structs, as they would only be part of the larger struct and nowhere else.

The one exception is suggesting trait-based approaches in some instances rather than ad-hoc const based ones. This is something I’ve thought about but haven’t got round to doing because it requires more indirection in the code. I’m unsure what advantages that could bring. Currently if a user wants to define their own risk measure, they only have to define the high-level public API function. With a trait-based approach, they’d have to implement all the internal trait-based functions. Dispatch can either happen at the public API level, or at the internal private level. I’m not sure which is better.

For other purposes, like when type information is not readily available at parse time due to working with vectors or nested vectors, I do use trait-based approaches and it’s mentioned nothing about that.

In all, this is in line with what I’ve found LLMs can do. It’s surprisingly competent at being an extremely thorough—if at times, overzealous copy-editor. When generating code, this overzealousness can make it do stupid things by unnecessarily complicating things, and sometimes shooting itself in the foot.

Fortunately, for anyone with cursory technical understanding of programming in general, it’s been my experience that it’s quite easy to spot mistakes or sus code. I think the big issue is that the volume and rate at which these things can generate code can make it incredibly easy for a reviewer to miss things.

Honestly, i don’t know how one would police this other than the honour system (ie if you’ve used your vibe coded package and can vouch for its code), some sort of allowed contributor list, automated reviews via statistical tools (counting incidences of known LLM-generated patterns), actual reviews by a separate agent, and eventually an actual human review.

That and having policies for keeping the official registries in “good” order, to some definition of “good”.


  1. That’s not saying the tests are as thorough as the percentage might indicate, as the nature of the package allows for combinatorially scaling parameter combinations, but it’s the best i can do at the moment. ↩︎

I’m not familiar with any robust or definitive benchmarks to verify the statement one way or the other, so I can only draw my opinion based on my own experiences and observations.

If you are familiar with such reliable benchmarks, I recommend you contact one of the big AI labs as they will probably pay you quite a lot of money for it :wink:

1 Like

The problem is that the field moves so quickly, that any benchmarks coming out have to be updated in real time. Papers take so long to publish, that by the time they come out the conclusions are no longer relevant. That makes it very difficult to have informed discussion about the benefits of AI.

As a recent example of how fast it’s moving, the latest opus model was able to prove a research problem that Don Knuth was working on: https://www-cs-faculty.stanford.edu/~knuth/papers/claude-cycles.pdf

1 Like