Should General have a guideline or rule preventing registration of vibe-coded packages?

Note: here vibe-coding means “building software with an LLM without reviewing the code it writes”.

Background: General is moderately but not totally permissive

Package registries across programming language communities come in various flavors of permissiveness. Some, like npm/pypi/cargo, automatically & immediately register a package once a few automated checks pass. Others, like R’s CRAN, or TeX’s CTAN, perform a manual review which can take a few days up to weeks. As an example of the stricter flavor, CRAN’s policies say:

CRAN packages should use only the public API. Hence they should not use entry points not declared as API in installed headers nor .Internal() nor .Call() etc calls to base packages. Also, ::: should not be used to access undocumented/internal objects in base packages (nor should other means of access be employed). Such usages can cause packages to break at any time, even in patched versions of R.

General falls in-between these two camps; like npm/pypi/cargo, new packages and new versions can be automatically registered, but unlike those, registration is not immediate. There is a 3 day waiting period for new packages, allowing community comment and review on the registration pull request (15 minutes for new versions). Additionally, anyone can prevent auto-registration by leaving a comment without [noblock], so review has “teeth” in some sense. However, General does not prohibit packages from using internals.

I think the more permissive registries (npm/pypi/cargo) do not necessarily need a policy on vibecoded packages, as they already tend to serve as purely infrastructure without curation (so they allow them), and the stricter registries already have strict policies and manual review (so they have strong mechanisms to push back or reject). That’s not to say neither group will adopt some policies in this area, but I think General is in a bit of a unique place where by being in-between on the permissiveness scale, we need to decide which way to lean in different situations.

Problems with low-effort / unreviewed LLM-written packages

  • They tend to have a bus factor of 0 where no human actually thought-through and understands the code.
  • tend to get “looks plausible but difficult to check if it’s actually correct”
  • complicated structure, unnecessary and verbose code, lengthy documentation
  • bugs can be more subtle or more hidden compared to equally-buggy code written by a novice human
  • they can be very quick to produce, which amplifies the other issues

Problems with writing policy or guidance

  • hard to clearly define what is “vibecoded”
  • needs to be manually checked
  • ambiguous rules tend to be applied unevenly
  • incentivizes authors to hide markers of vide-coding without necessarily solving the problem of lack-of-understanding

Community feedback

What do folks think? Should General be home to vibecoded packages or should it try to prevent them from being registered? What mechanisms/guidance/policies might be effective?

12 Likes

I don’t think there’s controversy about whether we want vibe-coded packages in General. I personally saw two, and the quality is exactly what I expected. We should keep in mind that nice names is a finite resource, that is becoming scarcer every day, and we don’t want to waste them on low-effort packages.

I think having an explicit policy is helpful, even if we can’t detect all cases. First to make it clear to good-faith submitters that vibe-coded packages are not welcome, and secondly to have a clear-cut reason to reject submissions that try to do it anyway but we notice.

13 Likes

These two points we already have with the recommendations/guidelines for naming packages, so I think having policies/guidance for LLM based code is also just fine.

I also think, that packages with too low quality/effort might provide a challenge to the community and I would like to be able to encourage developers to put in effort into their code / docs / tests. When I can refer for that to some policy/guidelines I feel a bit more safe commenting on PRs to General.

4 Likes

Yeah, that’s a tough one.
I’ve been vibe coding a couple of simple utility packages for local usage, which I wouldn’t register anytime soon, but certainly added some value for me, so it would be weird if they get rejected once I decide they’re ready for the public.

On the other hand, I find it pretty scary to think that in a couple of years, you’ll find a julia package for anything you search, and they all have lots of tests and documentation and look super polished, just to find out after lots of wasted time, that there’s no actual substance to the package, the test cases are made up to pass, internals simply try/catch around bugs and most things dont work or are incredibly contrived.

I dont see how we could have good guidelines for this, that can be easily applied without diving into the package and checking if everything works as described.
Also, as @ericphanson already pointed out, just putting pressure on hiding obvious markers for ai wont get us very far, and there are definitily cases for simple packages that could be 100% AI generated and working very well for the community (e.g. claude code with Julia execution is pretty good at testing and fixing functionality for a clear and simple problem).

So, I guess we’ll need to rely on the gut feeling of the volunteers for any foreseable time?
To make this easier, maybe we could add an ai checker pass, to give a percentage of code that is expected to come directly from AI, so if gut feeling says, this is just an unusable ai package, and the ai score is high, its easy to deny.

I think there’s also a social aspect, where pointing out to a new member of the community, that their package looks like AI slob, just because they created the README with the help of AI, seems always pretty harsh to me if I’m not 100% sure.
So, I guess we should also have guidelines about how to communicate this.
A “neutral” score from a CI pass could help with this.

I think we should also not focus on is this AI generated vs human generated, but instead is it useful/working or broken/not useful.
This would put too much work on reviewers though, but I wonder if we could reverse the problem, accept quite easily, but actually have a process for flagging registered packages as not useful, or completely broken.

Finally, just a reminder about the ability to spot AI I saw recently on mastodon:


(You’ll never know about the AI cases you haven’t spotted)

8 Likes

As the most active “triage member” of the General registry, vibe-coded packages are starting to become a bit of a problem: For most normal packages that I “triage” in the #new-packages-feed on Slack, I spend probably less than 10 seconds to check that there’s some amount of documentation, the package name seems reasonable, maybe a name-similarity override is needed. With LLM-generated packages, this doesn’t work. They look okay on the surface, but as soon as your read closer, things just don’t make sense. But then I have to really think about whether it just doesn’t make sense to me , and maybe there’s just a different perspective, or whether I can tell for sure that it’s LLM slop. That sort of thing really sucks time, and also leads to awkward conversations. I can probably still tell in 10 seconds is something smells like LLM, so it might be helpful to have a guideline to point to that says “no vibe-coding allowed”, without having to go into details. Then the author can still come back and say “but I really double checked everything, and I stand by the quality of the package”. I’m certainly not opposed to the appropriate use of LLMs along the lines of Chris’ Discourse post on the topic.

It is also important to note that the review process for the General registry basically doesn’t involve any really hard rules. We will still judge things on a case-by-case basis and always try to reach good-faith consensus. But I think an official “vibe-coded packages are not suitable for registration in the General registry” would be a helpful baseline.

I’d much prefer to do the “gut check” only. As soon as add something automated, people start “gaming” it in exactly the way you suggest (instructing Claude to avoid emojis). This already happens with the automated name checks. People rename their packages just to “make the bot happy”, even if the new name is worse than the original and they should have just asked for an overrride.

17 Likes

That’s another important point, @goerz is a precious resource and we don’t want to do a DDoS attack on him.

11 Likes

I don’t see why vibe coded packages should be singled out. That’s just an exception for the sake of not liking something, specifically because it can give you bad quality in many cases. But there’s lots of other things like that: people shouldn’t write their own linear algebra routines in their package because it’s almost certainly a sign of a bad quality package (i.e. 3 line matmul or lu) in almost every case (probably a better indicator of bad quality even than knowing that a package is purely vibe coded), but of course there are clear exceptions like writing a BLAS instead of calling one could just be better.

The rule should be just about quality. As long as it’s documented, well-tested, does what it says, solves a real problem, is something others would use, and is licensed correctly it’s an accept. How you get to that level of quality really doesn’t matter.

But what is different is the speed at which you can pump out low quality packages. I think what we’d be looking for is more of a rule to throttle new packages or block a user who abuses it, i.e. keeps trying to take names and keeps getting caught, then we just ban them. If what we’re worried about is that, then the target should be that behavior, not LLMs in general. I mean, you can even do that by just forking other packages and renaming, or generating packages from homework materials (and people have done this stuff and have gotten warnings).

Indeed, you can vibe code 90% of the package and fix just 60 lines to get a good package in many cases. Or you could write all of the code and vibe code the first README and see “ehh that’s good enough”. This for example would trigger my “there’s lots of bullets and emojis, therefore it’s Claude written” just by glancing at the README, but all of the code could be fine.

3 Likes

It’s currently tricky to “ban” users from General. We started to put something into place, but it never made it into production. Besides, people can change for the better, so “bans” shouldn’t be permanent.

The rule should be just about quality.

In principle, yes, but that’s even harder (and unpleasant) to argue about

As long as it’s documented, well-tested, does what it says, solves a real problem, is something others would use, and is licensed correctly it’s an accept. How you get to that level of quality really doesn’t matter.

I absolutely agree, and obviously we don’t want to ban LLM categorically. If someone manages to use an LLM effectively to produce a good package, and has the experience to guide and check the output, that can be totally fine in principle (but that’s not what typically happens). There’s also a (somewhat blurry) line between “LLM-assisted” (fine) and “vibe-coded” (not fine)

Indeed, you can vibe code 90% of the package and fix just 60 lines to get a good package in many cases.

If done judiciously, that’s fine.

That’s very specifically what I would want to reject (wearing my registry triage member hat). The code could be fine, or it could not be. It’s extremely difficult and time-consuming to check (especially if I’m not a domain expert in the domain of the package). Of course, I can ask the author to “fix it” (trim down / check the LLM-generated parts), but then how do I trust that.

This is exactly a situation we’re currently dealing with, with packages that are pending registration or have recently been registered.

There are also certain ways that an LLM-assisted package can convince me that it’s fine: a properly set up (Documenter-based) documentation that’s not itself overly LLM-vibey, and a test suite that is properly set up in CI and shows decent coverage. Vibe-coded packages almost always lack these… And those could also be “rejection criteria” by themselves. Although we usually only require “some documentation” (a decent README can be enough) and haven’t really enforced testing. Maybe we should start to enforce testing as a hard requirement. That would go a long way at ensuring correctness.

5 Likes

I think I’m echoing @goerz here, but regarding:

“Focus on the results” is an appealing principle but the problem is that the intuitions, experience, and heuristics humans use to assess results often don’t work well for LLM code. Humans make different errors and give different indications of lack of understanding in the code they write than LLMs do. So it becomes extremely burdensome to try to figure out if an unreviewed LLM-generated package is actually doing what it should.

That’s why I think it could make sense to have dedicated guidance around LLM code specifically, if there is something we can do to lower maintenance burden.

Basically the status quo seems kind of unsustainable, where we rely on extensive effort by volunteers to assess case-by-case. One option is to pull back from curation and move towards npm/pypi/cargo and say anything goes. That has lower maintenence! Another option is to add tools to the maintainer’s toolbelt so they don’t need to do extensive checks. Or something else :slight_smile:. But I think it’s missing the point to say quality is the only thing that matters, neglecting the effort required to asses quality.

edit: also I think rate-limiting, bans, etc is a separate issue, which is a bit complicated technically but fairly straightforward from a social/community perspective, so I don’t think we need to focus on it here.

5 Likes

We’ve basically already done that. Everyone still remembers Tony right? We learned all the way back then that having a code czar doesn’t scale. That’s when we changed to General being permissive by default, and it was required in order to make v1.0 Julia work given the growth we had. From time to time someone comes in trying to police it again, and every time that happens they grow tired pretty quickly for the same reason.

I think we just have to be permissive. We should test what’s easy, for example we should probably require docs and not auto-merge anything without docs or CI. But beyond what is easy to test, the human labor should flag issues when we can, rollback / ban when we need to, etc. and be willing to accept that there are some things in General which may not match what we call high quality and that’s okay as long as any truly bad issues (something gets abandoned, name-squatted, security issue, etc.) are what the human time is saved for rapidly handling.

2 Likes

^this :100:

I think longer term, an appropriate approach could be a split in the registry, as linux distributions often do:

Have a smaller registry for “curated”, and a larger extended registry for “almost uncurated”.

The extended uncurated registry could be more open to vibe-coded or generally garbage packages, with the main central control of removing spam and malware / backdoors.

The smaller curated registry on the other hand could be much stricter. Especially, it should have a mechanism for taking over abandoned packages (package X is abandoned, but other packages rely on it → X can be taken over and fixed). Also, it could be useful for discovery (the current general registry contains too much abandoned crap to be very useful for that).

Such a registry split doesn’t solve the vibe-coding issues, but it widens the space of possible solutions we can consider.

13 Likes

Not really. I think if anything, we’re moving towards tightening the supervision of the General registry. Certainly, there’s a lot more emphasis on good package naming. We don’t generally do detailed code / quality reviews, but it’s a lot more regulated the PyPI. I feel like the Julia community takes the “collective ownership” of the ecosystem pretty seriously, and that’s a good thing.

6 Likes

No way, it’s still definitely way less supervised than before. You’re not checking every line of code of every patch release like we did in the past? Not allowing any type-piracy? Etc. It used to be way less permissive than it is now. We of course had to drop that because of the human overhead.

Oh, I’m not saying that! (That was before my time). And I’m definitely not in favor of having CRAN-like code reviews. The very large majority of packages (those that get a green check from me on the #new-packages-feed on Slack) just sail through without issues, and I spend about 10 seconds on reviewing them.

Every once in a while, somebody who is interested in a package notices some quality issue, and then things can go deep.

But for the small percentage of packages that doesn’t make it through a glace it’s pretty frequent that there is some recommendation for a more suitable package name. Also “is there any chance this could be contributed to the existing package X instead of registering a new package”. And, of course, the most common issue, “Could you add a little bit of documentation”

Some of it it also connected to how general the package name is. I was just making the point in one of the GitHub threads that if “Distributions” was registered today, I would put a lot of scrutiny on it being able to live up to the expectations that would come with general name like that (being in an org with multiple maintainers that have some kind of long-term funding).

You might be happy that SciML is pretty much a “flagship” org, so anything that goes on there I tend to have pretty high confidence in and would treat with a very light touch. So you may not be seeing much of this kind of activity.

I’d be more sympathetic to the “focus on the results” point of view if I had ever seen a vibe-coded package whose quality wasn’t abysmal. The way things are, “focus on the results” is a lot of work for no gain.

Perhaps we should wait for proof that unicorns exist before we go hunting for unicorns?

1 Like

I vibecode many packages. Mostly small experiments, but some may eventually be useful enough to register. Certainly vibecoded packages have lower code quality than manually written ones at the moment, even in skillful hands (and can be abysmal in unskilled hands), but I don’t think that’s grounds for categorical refusal. After all, we don’t have a rule “must have at least 4 months of programming experience before registering a package”, even though similar things could be said. I think the answer here is just to extend the disclosure requirement from Base to the registry. If the AI use is obvious, but not disclosed, ding the author for rule violation. I think that solves a fair bit of the “vibe-coded not because it’s useful, but because I want to look impressive” style PRs. Then for packages with the proper disclosure, there can be a case-by-case discussion on code quality. In general, I’ve always been a strong advocate that the General registry maintainers have plenary authority to manage the namespace for the benefit of the end user. However, as with all such power, it needs to be exercised with caution to avoid the appearance of capriciousness. I’m worried we’ve moved a bit too far in that direction recently anyway (not a criticism of the people doing the work - it’s a hard job). All that to say - let’s try the disclosure rule and see what happens.

4 Likes

But then I have to really think about whether it just doesn’t make sense to me , and maybe there’s just a different perspective, or whether I can tell for sure that it’s LLM slop.

What is the motive for people to create and register slop packages? Is this resume padding? Are the packages there to call hidden malicious code? Is it about ‘reserving’ package names for some future purpose? Or are people are curious about the package creation process and are testing it out?

Maybe there should be a secondary general registry for new packages, which require a flag to Pkg.add. If enough other packages have it as a dependency, then it gets added to general. If not, then after some fixed time it gets dropped.

Could you give an example of such a submission? Not trying to single anyone out, just want to know what it looks like when it’s so bad you would not accept it.

Hard to say. Seems to be the most common factor is inexperience coupled with over-enthusiasm. People can get very excited about vibe coding.

People also sometimes overestimate how important it is to register a package in the General registry. Coming from other ecosystems, they sometimes feel they should register as soon as the project is off the ground (or even with “placeholder registrations”, which be prohibit, but which are common e.g. on PyPI).

It’s generally better to make sure a package is actually “ready” for a wide audience before registering it. Before that, there’s always LocalRegistry, and also tooling is becoming much better around working with unregistered packages, with new Pkg features like [sources] and [workspace].

2 Likes

I’m a little reluctant, but you can find it if you dig into the #new-packages feed on Slack and look for recent submissions that don’t have a green checkmark, or if you scroll through the open PRs in the registry. But please, let’s not dump on that particular person.

6 Likes