An “AI” that can not correctly answer, how many letters “r” are in the word “strawberry,”
and has similar gaping failures, is not “better than most humans at most tasks.”
Coding agents constantly confabulate and make up false claims, vandalize
repositories and so forth, all this is well-known and frequently discussed.
It’s not necessary to search far and wide for examples – PR recently linked here by an over enthusiastic LLM proponent has one, for instance.
I suspect you overlooked a pretty key qualifier in my original statement
state of the art coding agents
surely you and I both agree that many model and many agents are sloppy, hallucinate, and too error-prone to be useful. but that is not the case for Opus 4.6, Codex 5.4, or similar flagship tools from frontier labs! it is hard to overstate the enormous amount of improvement that has transpired over the past year.
Especially if you haven’t tried any of the recent releases of these tools in a while, or only use the free-tier models, I would recommend to leave some wiggle room in your opinions until you have seen what the latest stuff can do.
I know you’re both using that pull request to grind your own axes, but let’s put the axes away. We don’t wield cudgels like that here or on GitHub, either. Our community standards to be respectful and civil apply in both cases, and this is neither.
We’ve had a largely productive conversation here about General’s policies, but we’re now moving away from that. I’ll bring this to a close.