Paper on Perceived vs Real Productivity Gains Using LLM

andreeco · October 21, 2025, 7:33am

This is an interesting paper on the efficiency of using LLM tools for coding. A takeaway is that, in the study, the perceived productivity gains are about 24%, while in reality they are about 19% slower. In Table 3 on page 19, they compare their results to other papers, which differ in outcome.

ChrisRackauckas · October 21, 2025, 8:07am

Developers then work on their assigned issues in their preferred order—they are allowed to flexibly complete their work as they normally would, and sometimes work on multiple issues at a time. After completing an issue to their satisfaction, they submit a pull request (PR) to their repository, which is typically reviewed by another developer. They make any changes suggested by the PR reviewer, and merge their completed PR into the repository. As the repositories included in the study have very high quality and review standards, merged PRs rarely contain mistakes or flaws. Finally, they self-report how long they spend working on each issue before and after PR review.

This pretty directly violates the productive way to use the LLMs. Basically, if they fail on a problem, it’s not even worth fighting them. Just dump it, scrap the PR and move on. If you take that approach, it’s effectively impossible for the LLM to take any meaningful amount of your time. Yes, they only solve a small percentage of problems, probably like 30% of issues, but that’s the point. Point it to only easy issues, dump it if it gets the solution wrong, take free wins when you get it and nothing else.

If you’re sitting there like “the LLM can’t figure this one out, let me try again”, then you’re just stuck in a doom loop that isn’t worth anyone’s time. So I’m not surprised the study found it less productive: this is not a productive way to use them.

andreeco · October 21, 2025, 8:32am

Thanks for your reply! I think you make a very important point, and it is possible that they did it this way. However, they were not required to do it as written on page 6:

Developers working on issues for which Al is allowed can use any Al tools of their choosing, or no Al tools if they prefer.

But it’s really important to use LLMs in a smart and non-naive way.

j_u · October 21, 2025, 12:55pm

I’m not sure this is correct. I learned about GPT on Julia Discourse and have had an account there since day one or two. At the beginning, I was using it to explain and gain a general understanding of complex and abstract mathematical concepts related to quantum computing that were far beyond my math skills. It was hallucinating a lot, but for this particular subject it was still much better for me than directly reading [those] research papers. Later, I started using it to get coding advice in a web browser. It was still hallucinating, though a bit less over time.

I almost never code with GitHub, so I can’t fully comment on solving PRs automatically through a CLI. I usually chat with the model via Cursor, Windsurf, Trae, or Kiro. For about a year now, I’ve been coding exclusively with the help of AI. At first, it was about 60% AI and 40% me and still with plenty of hallucinations. Now, I usually just change a few lines by hand. I currently use Anthropic models almost exclusively, and occasionally GPT. At this point, I wouldn’t say it’s hallucinating anymore. I’d describe the issue as more of an interface problem between my brain and the model. Of course, it still makes mistakes, but I think the main challenge is that the model often struggles to infer my true intentions. And the interface is slow and inference could definitely be faster.

As for productivity. It takes time, no question about that. Sometimes more then expected. In addition to Julia, I also code in C and q, which are languages with quite steep learning curves. There’s no doubt I wouldn’t have been able to learn those languages on my own in the same amount of time. Not to the level to be able to solve problems as with the help of these models. However, in the long term, I’m not sure what strategy to choose. I think, in general, it’s better to invest in self-development … but … this time, making friends with these comp neuro models. My guess is that we’re currently seeing only a small part of their full potential. From the paper:

However, normal usage of Cursor’s AI tools does not typically involve sampling more than a few thousand tokens from models. Recent literature shows that model performance can improve significantly with respect to the number of tokens sampled at inference time [40], so it’s natural to wonder if the lack of speedup is driven by limited token spend.

However, we can imagine alternative elicitation strategies that effectively use much higher token spend, like sampling many trajectories in parallel from agents and using an LLM judge (or e.g. self-consistency [41]) to filter to the output most likely to be useful for the human. We do not provide evidence about these elicitation strategies, as developers in our study typically use Cursor and web LLMs like chatGPT, so it remains unclear how much effect these strategies would have on developer productivity in the wild.

P.S. 1: To sum up, I guess the whole situation could be compared to the analogy: is it better to write a paper with a typewriter or a search engine and a computer?

P.S. 2: BTW, I know that it’s prohibited here on Julia Discourse to ping people who aren’t participating in a topic (that’s what I read a few weeks ago). However, I think that — @mbauman — might be a late adopter. :- )

Benny · October 21, 2025, 1:18pm

Also worth taking a look at Table 2 that details several ways that the study doesn’t generalize, which contradicts many blogposts and pop science articles about it. I doubt anybody can make blanket statements; there’s always some teams working on completely different things, and it’s not surprising that an LLM can do much better at one application than another.

mbauman · October 21, 2025, 1:20pm

It’s not prohibited, but it is discouraged because it’s most frequently done in highly disrespectful ways.

j_u · October 22, 2025, 8:36pm

It looks a bit lame.

ChrisRackauckas · October 23, 2025, 8:07am

The whole point is that the premise is wrong. It’s like “is it slower to use a internet or a library?” and the methodology section was “search for the thing on Google, flip through and click through all of the links in the Google search results until the answer is found, or try again”. Then the claim is “using a library is faster because we found finding the results could take many pages, or not exist!”.

The reasoning for the issue then ends up being simply due to how mean time is calculated: most of the “internet” results are found in 20 seconds of Googling vs 5 minutes in the library, but in the cases where the result doesn’t exist in the internet, it takes a few hours with this methodology to exhaustively click through all of the results and know it doesn’t exist, while in the library it’s 10 minutes to go I guess there isn’t a book on the subject, and the librarian told me so.

Of course, the problem is, that’s not how people search. Or at least, it’s not how you should use Google search and you’re doing it wrong . You google a term, look at the first page, if it’s not there google a different term, otherwise head to a chatroom / Discourse and ask for help (or these days, try an LLM).

You have the exact same issue happening in this methodology. There is a very long tail for how long it can take to finish a PR. I wrote in pretty copious detail, don’t let that tail fool you.

If you just accept that some PRs work and some don’t, it takes ~10 seconds per PR. Maybe like 3 minutes if you let it iterate a few times. Any result that says it takes longer than that… what the heck are they doing? For that problem just give up and do it yourself, in which case it just takes a long as doing it without AI.

j_u · October 23, 2025, 10:59am

I still sustain what I wrote above. Again, the whole thing related to large comp neuro models feels like an analogy to writing a paper with a library and a typewriter versus using a search engine and a computer connected to the internet, with immediate access to all the digitally resampled books. It’s as if the library were on steroids. Yes, I noted your writing about your mysterious high-performance multithreaded thin client, albeit, please be informed that the world is currently talking about Tilly Norwood and Anthropic joining Colab. To sum up, let me add a smile. :- )

Topic		Replies	Views
AI bubble: time to panic? Perhaps not yet... maybe now Offtopic	65	3508	September 19, 2024
Is Julia Falling Behind in Relevance? (because it's not used in LLM research?) Offtopic	69	8179	July 16, 2025
Reliability of AI coding tools Offtopic llm	31	562	November 19, 2025
The use of Claude Code in SciML repos Tooling	20	4902	December 5, 2025
AI tools to write (Julia) code (best/worse experience), e.g. ChatGPT, GPT 3.5 Offtopic	62	16635	May 14, 2024

Paper on Perceived vs Real Productivity Gains Using LLM

Related topics