My position on AI could be described as Zizekian, on both the technical and political aspects. In other words, I dislike its impact on the environment, its potential to stymie human creativity, and independent creators, but also view it with optimistic caution.
Many of these arguments seem like echoes of the arguments made against the printing press, the opening of universities to non-nobles, the radio, the television, the internet. I’m sure back in the neolithic there were people objecting to writing down writs of trade on clay rather than keeping it all in your head.
As part of this cautious optimism, and for my own curiosity I asked @kahliburke to let loose the process of code review he described earlier on the package I’m developing, PortfolioOptimisers.jl.
I think this is a good test, because it is quite a complex codebase. It has incomplete, but still a lot of documentation, including examples, and has almost 90% test coverage. So it seemed like a good idea to put an agent through its paces reading source code, docs, and tests to come up with some sort of overall code review.
I know this is different to code generation, but obviously what an LLM thinks code should look like, vs the code it produces, should be highly correlated.
I have to say, I’m impressed. It’s really not offered any new insights, but it’s pretty much found every single problem I have with the codebase. Unfortunately, it doesn’t usually offer better alternatives. It complains of the same things I complain about, but neither it nor I have found better alternatives.
It also suggests some suspect changes, I’m using the concrete struct pattern, and it’s suggesting i either remove types, or lump sets of parameters into smaller structs. However, the first loses type information for no benefit because dispatch does not usually happen at the struct level, but inside functions using only some fields of the struct. The second adds unnecessary friction without enabling reusability of the smaller structs, as they would only be part of the larger struct and nowhere else.
The one exception is suggesting trait-based approaches in some instances rather than ad-hoc const based ones. This is something I’ve thought about but haven’t got round to doing because it requires more indirection in the code. I’m unsure what advantages that could bring. Currently if a user wants to define their own risk measure, they only have to define the high-level public API function. With a trait-based approach, they’d have to implement all the internal trait-based functions. Dispatch can either happen at the public API level, or at the internal private level. I’m not sure which is better.
For other purposes, like when type information is not readily available at parse time due to working with vectors or nested vectors, I do use trait-based approaches and it’s mentioned nothing about that.
In all, this is in line with what I’ve found LLMs can do. It’s surprisingly competent at being an extremely thorough—if at times, overzealous copy-editor. When generating code, this overzealousness can make it do stupid things by unnecessarily complicating things, and sometimes shooting itself in the foot.
Fortunately, for anyone with cursory technical understanding of programming in general, it’s been my experience that it’s quite easy to spot mistakes or sus code. I think the big issue is that the volume and rate at which these things can generate code can make it incredibly easy for a reviewer to miss things.
Honestly, i don’t know how one would police this other than the honour system (ie if you’ve used your vibe coded package and can vouch for its code), some sort of allowed contributor list, automated reviews via statistical tools (counting incidences of known LLM-generated patterns), actual reviews by a separate agent, and eventually an actual human review.
That and having policies for keeping the official registries in “good” order, to some definition of “good”.