This is an interesting paper on the efficiency of using LLM tools for coding. A takeaway is that, in the study, the perceived productivity gains are about 24%, while in reality they are about 19% slower. In Table 3 on page 19, they compare their results to other papers, which differ in outcome.
Developers then work on their assigned issues in their preferred order—they are allowed to flexibly complete their work as they normally would, and sometimes work on multiple issues at a time. After completing an issue to their satisfaction, they submit a pull request (PR) to their repository, which is typically reviewed by another developer. They make any changes suggested by the PR reviewer, and merge their completed PR into the repository. As the repositories included in the study have very high quality and review standards, merged PRs rarely contain mistakes or flaws. Finally, they self-report how long they spend working on each issue before and after PR review.
This pretty directly violates the productive way to use the LLMs. Basically, if they fail on a problem, it’s not even worth fighting them. Just dump it, scrap the PR and move on. If you take that approach, it’s effectively impossible for the LLM to take any meaningful amount of your time. Yes, they only solve a small percentage of problems, probably like 30% of issues, but that’s the point. Point it to only easy issues, dump it if it gets the solution wrong, take free wins when you get it and nothing else.
If you’re sitting there like “the LLM can’t figure this one out, let me try again”, then you’re just stuck in a doom loop that isn’t worth anyone’s time. So I’m not surprised the study found it less productive: this is not a productive way to use them.
Thanks for your reply! I think you make a very important point, and it is possible that they did it this way. However, they were not required to do it as written on page 6:
Developers working on issues for which Al is allowed can use any Al tools of their choosing, or no Al tools if they prefer.
But it’s really important to use LLMs in a smart and non-naive way.
I’m not sure this is correct. I learned about GPT on Julia Discourse and have had an account there since day one or two. At the beginning, I was using it to explain and gain a general understanding of complex and abstract mathematical concepts related to quantum computing that were far beyond my math skills. It was hallucinating a lot, but for this particular subject it was still much better for me than directly reading [those] research papers. Later, I started using it to get coding advice in a web browser. It was still hallucinating, though a bit less over time.
I almost never code with GitHub, so I can’t fully comment on solving PRs automatically through a CLI. I usually chat with the model via Cursor, Windsurf, Trae, or Kiro. For about a year now, I’ve been coding exclusively with the help of AI. At first, it was about 60% AI and 40% me and still with plenty of hallucinations. Now, I usually just change a few lines by hand. I currently use Anthropic models almost exclusively, and occasionally GPT. At this point, I wouldn’t say it’s hallucinating anymore. I’d describe the issue as more of an interface problem between my brain and the model. Of course, it still makes mistakes, but I think the main challenge is that the model often struggles to infer my true intentions. And the interface is slow and inference could definitely be faster.
As for productivity. It takes time, no question about that. Sometimes more then expected. In addition to Julia, I also code in C and q, which are languages with quite steep learning curves. There’s no doubt I wouldn’t have been able to learn those languages on my own in the same amount of time. Not to the level to be able to solve problems as with the help of these models. However, in the long term, I’m not sure what strategy to choose. I think, in general, it’s better to invest in self-development … but … this time, making friends with these comp neuro models. My guess is that we’re currently seeing only a small part of their full potential. From the paper:
However, normal usage of Cursor’s AI tools does not typically involve sampling more than a few thousand tokens from models. Recent literature shows that model performance can improve significantly with respect to the number of tokens sampled at inference time [40], so it’s natural to wonder if the lack of speedup is driven by limited token spend.
However, we can imagine alternative elicitation strategies that effectively use much higher token spend, like sampling many trajectories in parallel from agents and using an LLM judge (or e.g. self-consistency [41]) to filter to the output most likely to be useful for the human. We do not provide evidence about these elicitation strategies, as developers in our study typically use Cursor and web LLMs like chatGPT, so it remains unclear how much effect these strategies would have on developer productivity in the wild.
P.S. 1: To sum up, I guess the whole situation could be compared to the analogy: is it better to write a paper with a typewriter or a search engine and a computer?
P.S. 2: BTW, I know that it’s prohibited here on Julia Discourse to ping people who aren’t participating in a topic (that’s what I read a few weeks ago). However, I think that — @mbauman — might be a late adopter. :- )
Also worth taking a look at Table 2 that details several ways that the study doesn’t generalize, which contradicts many blogposts and pop science articles about it. I doubt anybody can make blanket statements; there’s always some teams working on completely different things, and it’s not surprising that an LLM can do much better at one application than another.
It’s not prohibited, but it is discouraged because it’s most frequently done in highly disrespectful ways.
It looks a bit lame.