Not So Fast: AI Coding Tools Can Actually Reduce Productivity
別高興得太早:AI 寫程式工具其實會降低生產力
Study Shows That Even Experienced Developers Dramatically Overestimate Gains
研究顯示,連經驗老道的開發者都嚴重高估了效益
UPDATE: on Wednesday July 16, I’ll be holding a fireside chat in SF with the primary authors of the paper – register here to attend!
更新一下:7 月 16 日(星期三)我會在舊金山跟這篇論文的主要作者們辦一場爐邊對談,想參加的快來這裡報名!
The buzz about AI coding tools is unrelenting. To listen to the reports, startups are launching with tiny engineering teams, non-programmers are “vibe-coding” entire apps, and the job market for entry-level programmers is crashing. But according to a METR experiment conducted in the spring of 2025, there’s at least one cohort that AI tools still aren’t serving.
AI 寫程式工具最近真的超夯,討論沒停過。聽說新創公司靠著小小的工程團隊就能上線,連非程式背景的人都能「憑感覺」寫出整個 App,搞得入門級程式設計師的工作機會都快沒了。但根據 METR 在 2025 年春天做的一項實驗,至少有一群人,AI 工具目前還幫不上忙。
METR performed a rigorous study (blog post, full paper) to measure the productivity gain provided by AI tools for experienced developers working on mature projects. The results are surprising everyone: a 19 percent decrease in productivity. Even the study participants themselves were surprised: they estimated that AI had increased their productivity by 20 percent. If you take away just one thing from this study, it should probably be this: when people report that AI has accelerated their work, they might be wrong!
METR 做了一項很嚴謹的研究(部落格文章、完整論文),想看看 AI 工具對經驗豐富、在成熟專案上工作的開發者能帶來多少效率提升。結果跌破大家眼鏡:效率竟然下降了 19%!連參與研究的開發者自己都很意外,他們原本估計 AI 讓他們的效率提高了 20%。如果這項研究你只能記住一件事,那大概就是:當大家說 AI 讓他們工作變快時,他們可能搞錯了!
This result seems “too bad to be true” – so astonishing that it almost has to be spurious. However, the study was carefully designed, and I believe the findings are real. At the same time, I believe that at least some of the anecdotal reports of huge productivity boosts are real. This study doesn’t expose AI coding tools as a fraud, but it does remind us that they have important limitations (for now, at least) – confirming some things my colleague Taren wrote about in a previous post, First, They Came for the Software Engineers….
這個結果聽起來「太扯了,不像真的」—— 驚人到讓人覺得是不是哪裡有問題。不過,這項研究設計得很仔細,我相信結果是真的。同時,我也相信那些說 AI 大幅提升效率的傳聞,至少有一部分也是真的。這項研究並不是說 AI 寫程式工具是騙人的,但它確實提醒我們,這些工具還是有重要的限制(至少目前是這樣)—— 這也證實了我同事 Taren 之前在一篇文章《First, They Came for the Software Engineers…》裡寫的一些觀點。
To begin with, I’ll explain how the study was done, and why I believe its results.
首先,我會解釋這項研究是怎麼做的,以及為什麼我相信它的結果。
Finally, A Proper Scientific Trial of AI Coding Productivity
終於,AI 寫程式效率的科學實測來了!
The study was carried out in pretty much the most rigorous fashion possible: an honest-to-goodness randomized controlled trial under real-world conditions. The subjects were experienced developers carrying out their everyday work.
這項研究幾乎是用最嚴謹的方式進行的:一個貨真價實、在真實世界條件下的隨機對照實驗。受試者都是經驗豐富的開發者,做的是他們日常的工作。
The methodology was as follows:
研究方法是這樣的:
METR recruited 16 developers from major open-source projects.
METR 從主要的開源專案中找了 16 位開發者。Each developer selected a list of coding tasks from their todo list, breaking up large projects into tasks that they could complete in an hour or two. In all, 246 tasks were included in the study.
每位開發者從他們的待辦清單裡挑選一些寫程式的任務,把比較大的專案拆解成一兩個小時內可以完成的小任務。總共有 246 個任務納入研究。The developers estimated how long it would take them to complete each task (a) under normal conditions, and (b) without using any AI tools. The percentage difference between these figures yields the predicted speedup – the degree to which the developer expected that AI tools would boost their productivity.
這些開發者會估計完成每個任務需要多久時間,分成 (a) 一般狀況下,以及 (b) 完全不用任何 AI 工具的狀況下。這兩個數字的百分比差異,就是他們預期的加速程度——也就是開發者覺得 AI 工具能提升多少生產力。Each task was randomly assigned to one of two categories: “AI Allowed” (the developer can use any tools they like) or “AI Disallowed” (the developer cannot use AI coding tools or features).
每個任務都會隨機分到兩組其中一組:「允許 AI」組(開發者愛用什麼工具就用什麼)或「禁止 AI」組(開發者不能用任何 AI 寫程式的工具或功能)。The developers went about their work, while recording their screens for later analysis. After each task, they reported the time spent1. For AI Allowed tasks, they also estimated how much time AI tools had saved them – the retrodicted speedup.
這些開發者就開始工作,同時錄下螢幕畫面,方便之後分析。每完成一個任務,他們就會回報花了多少時間。至於那些允許使用 AI 工具的任務,他們還會估計 AI 工具幫他們省了多少時間——也就是他們事後回推的加速效果。
To compute the actual speedup – or, rather, slowdown! – provided by AI tools, the researchers compared the developers’ predictions of how long each task would take to the measured completion time. They found that the difference between predicted and actual times was 19% larger for AI Allowed tasks than for AI Disallowed tasks2. Remember that when the developers estimate the task time, they don’t yet know whether they’ll be using AI for that task, so their estimates are unbiased.
為了算出 AI 工具到底能幫你加速多少——或者說,其實是讓你變慢多少!——研究人員比較了開發者預估完成每項任務的時間,跟實際完成時間的差異。結果發現,允許使用 AI 的任務,預估時間跟實際時間的差距,比不允許用 AI 的任務大了 19% 2。別忘了,開發者在預估任務時間的時候,根本不知道這項任務能不能用 AI,所以他們的預估是很客觀的喔。

AI 寫程式碼的速度當然比人類快很多,但這不代表你就能第一個寫完喔!
The only significant way in which the study design falls short of the scientific gold standard is that it was not blinded: once work began, both the participants and the researchers knew whether AI tools were being used. This is of course unavoidable; there is no reasonable way of providing a “placebo” coding assistant. However, the researchers have specifically looked for, and found evidence against, a long list of alternative explanations – including the possibility of bias due to the non-blinded nature of the study. It really does seem that for the type of work carried out in this study, allowing developers to use AI tools slowed them down.
這項研究設計唯一比較不符合科學黃金標準的地方,就是它不是「盲法」研究:一旦開始實驗,參與者和研究人員都知道有沒有用 AI 工具。這當然沒辦法避免啦,畢竟要提供一個「安慰劑」程式碼助手,根本沒道理嘛。不過呢,研究人員有特別去尋找,而且也找到了很多證據,反駁了其他可能的解釋,包括因為不是盲法研究而可能產生的偏差。所以說,針對這項研究裡進行的工作類型,讓開發者使用 AI 工具,看起來真的會讓他們變慢耶。
Addressing Every Objection You Thought Of, And Some You Didn’t
你想到的所有反對意見,還有一些你沒想到的,這裡都幫你解答!
As I read through the study, I thought of half a dozen ways the results could have been confounded or otherwise invalid. It turns out that I’m not very imaginative; the authors address many more possible explanations.
當我讀完這份研究報告,腦中閃過好幾種可能讓結果失真或無效的狀況。結果發現我根本不夠有想像力,作者們竟然考慮到更多可能的解釋!
The John Henry Effect: perhaps the developers were motivated to “beat the machine”, working extra-hard on AI Disallowed tasks. If this were the case, you might expect to see the effect taper off over the course of the study, as the excitement and novelty wear off – recall that subjects performed an average of 15 tasks of 1-2 hours each. No such tapering was observed.
John Henry 效應:也許是開發者們被激發了「打敗機器」的鬥志,在那些不允許使用 AI 的任務上特別賣力。如果真是這樣,你可能會預期隨著研究進行,這種效應會慢慢減弱,因為新鮮感和興奮感會消退——想想看,受試者平均完成了 15 個任務,每個任務都要花 1 到 2 小時。但實際上,這種減弱的現象並沒有發生。
Underuse of AI. Perhaps developers weren’t using AI tools even when allowed? However, this could only explain a lack of productivity gain; it can’t explain a loss. And exit interviews and analysis of screen recordings both showed substantial use of AI (84% of screen recordings for AI Allowed tasks showed at least some use of AI).
AI 沒被充分利用?也許是開發者就算被允許用 AI 工具,也沒真的去用?不過,這頂多只能解釋為什麼生產力沒提升,沒辦法解釋為什麼會下降啊。而且,離職面談跟螢幕錄影分析都顯示,大家其實都有大量使用 AI(在允許用 AI 的任務中,有 84% 的螢幕錄影都顯示至少有用到一點 AI)。
Cheating. Perhaps developers used AI tools for AI Disallowed tasks? But again, this could at most explain a neutral result, not a productivity loss. And exit interviews and screen recordings found only rare instances of cheating.
作弊?也許是開發者把 AI 工具拿去處理那些不准用 AI 的任務?但話說回來,這頂多只能解釋結果持平,沒辦法說明為什麼生產力會下降啊。而且離職面談和螢幕錄影都顯示,作弊的情況其實很少見。
Overly optimistic time estimates. Developers are notorious for underestimating the amount of time it will take them to complete a task. However, because developers estimate time for a task before knowing whether they will have access to AI tools, any misestimation effects should cancel out.
時間估得太樂觀。工程師嘛,大家都知道他們超會低估完成任務需要的時間。不過,因為他們是在知道能不能用 AI 工具之前就估時間了,所以估不準的影響應該會互相抵銷掉。
Skewed task definitions. If developers were defining AI Allowed tasks differently than AI Disallowed tasks, that could throw off the results. However, this is ruled out, because developers specified their tasks and broke them down into chunks requiring less than two hours before the tasks were randomly assigned to the AI Allowed or AI Disallowed categories.
任務定義有偏頗。如果工程師把「可以用 AI」的任務跟「不能用 AI」的任務定義得不一樣,那結果可能就會跑掉。但這點可以排除了,因為工程師們是在任務被隨機分到「可以用 AI」或「不能用 AI」類別之前,就已經把任務講清楚,而且都拆解成每個小塊都不會超過兩小時了。
Dropout effects. Suppose that some AI Disallowed tasks were annoying to complete without AI, and so developers didn’t finish these tasks during the study and their results were not recorded. This could throw off the results. However, only 13 tasks were left uncompleted, against a total of 246 tasks completed, and the incomplete tasks were evenly divided between AI Allowed and AI Disallowed.
中途放棄的影響。假設有些 AI Disallowed 的任務,如果沒有 AI 輔助會很煩人,所以開發者在研究期間就沒完成這些任務,結果也就沒被記錄下來。這可能會影響結果。不過,總共完成了 246 個任務,只有 13 個沒完成,而且這些沒完成的任務在 AI Allowed 和 AI Disallowed 之間是平均分配的。
Out-of-date tools. AI capabilities are progressing rapidly; perhaps the results suffered from use of out-of date tools? But developers primarily used Cursor Pro with Claude 3.5/3.7 Sonnet, which were considered state of the art during the time period of the study (February – June 2025).
工具太舊了?AI 技術進步超快,會不會是用的工具太落伍才影響結果啊?不過,開發者們主要都是用 Cursor Pro 搭 Claude 3.5/3.7 Sonnet 耶,這在研究期間(2025 年 2 月到 6 月)可是當時最頂的組合了。
Inaccurate time reporting. Statistics are based on developer’s reports of how long it took them to complete each task, both before and after PR review. However, an analysis using only before-review times, or using the lengths of screen recordings, yields similar results.
時間回報可能不太準。這些統計數據是根據開發者自己回報的,他們說完成每個任務花了多少時間,包含 PR review 前跟後的時間。不過,如果只看 review 前的時間,或是用螢幕錄影的時間長度來分析,結果其實也差不多啦。
None of the issues mentioned so far seem to have thrown off the results. Here are a few factors that may have contributed to the observed decrease in productivity.
前面講的那些問題,好像都沒影響到結果耶。不過,生產力下降這件事,可能跟下面幾個因素有關。
Overuse of AI. The paper states that “developers were instructed to use AI to whatever degree they thought would make them most productive”. However, some subjects seem to have gotten carried away, and this may have contributed to the observed slowdown. In exit interviews, subjects were roughly evenly divided in describing their AI usage as “using AI as normal” (70 tasks), “experimenting with AI” (119 tasks) or “using as much AI as possible” (57 tasks). Those who self-reported as being in the first category were less slowed down by AI (though they still were not sped up).
AI 用過頭了啦!這篇論文有提到,「開發者被要求盡量使用 AI,用到他們覺得最有效率的程度」。不過,有些人好像玩開了,這可能就是導致速度變慢的原因之一。在離職面談時,受試者大概分成三類,使用 AI 的情況分別是「正常使用 AI」(70 個任務)、「實驗性使用 AI」(119 個任務)或「盡可能使用 AI」(57 個任務)。那些自認是第一類的人,雖然還是沒有變快,但至少被 AI 拖慢的速度比較少。
Lack of experience with AI tools. If developers weren’t experienced with AI tools, they might make poor use of them. As it turned out, 93% of participants reported past experience with tools like ChatGPT, and 44% had experience with Cursor (currently the most popular dedicated AI coding tool). This makes it sound like there were a range of experience levels, with a significant number of participants having relatively little experience with AI tools, so it might be that developers would see more benefit from these tools once they have more experience. However, all developers were gaining experience with AI tools over the course of the study, and this did not result in observable improvements from the beginning of the study to the end. (Also, all participants received “live basic training” for Cursor at the outset of the study.)
缺乏使用 AI 工具的經驗。如果開發者對 AI 工具不熟,可能就用不好。結果發現,93% 的參與者表示之前用過像 ChatGPT 這樣的工具,而 44% 的人有使用 Cursor 的經驗(Cursor 目前是最受歡迎的專門 AI 程式碼工具)。這聽起來參與者的經驗程度差異很大,不少人對 AI 工具的經驗相對較少,所以可能開發者經驗越多,從這些工具中獲得的好處就越多。不過,所有開發者在研究過程中都在累積使用 AI 工具的經驗,但從研究開始到結束,並沒有看到明顯的進步。(而且,所有參與者在研究一開始都接受了 Cursor 的「現場基礎訓練」。)
A related potential issue stems from the fact that some study participants switched from their normal development environment to use Cursor for AI Allowed tasks during the study. However, “broadly, developers reported that they were not significantly inconvenienced or affected by these differences compared to their normal workflows”.
有個可能相關的問題是,有些參與研究的人為了做那些 AI 允許的任務,在研究期間從他們平常習慣的開發環境換到用 Cursor。不過呢,「總體來說,開發者們都說跟他們平常的工作流程比起來,這些差異並沒有造成太大的不便或影響。」
It doesn’t seem that inexperience was a major problem, but it may be that developers with more expertise in making the best use of AI tools would see better results.
看起來經驗不足好像不是主要問題,但或許那些更懂得怎麼把 AI 工具發揮到極致的開發者,才能看到更好的成果吧。
Difference in thoroughness. Perhaps developers using AI tools expanded the scope of the task: for instance, writing code to handle more edge cases, adding additional features, or testing or documenting code more thoroughly. As potential evidence in this direction, developers added 47% more lines of code (per forecasted task size3) for AI Allowed tasks than AI Disallowed tasks. However, the study authors believe that this is at best weak evidence for scope creep. In private communication, they cited a number of reasons for this belief:
再來是仔細程度的差異。也許用 AI 工具的開發者會把任務範圍擴大,像是寫程式碼去處理更多邊緣案例、多加一些功能,或是更徹底地測試或寫文件。有個可能的證據是,在允許用 AI 的任務中,開發者寫的程式碼行數(根據預估任務大小 3)比不允許用 AI 的多了 47%。不過,研究作者認為這頂多只能算是範圍擴大的薄弱證據。他們私下溝通時,列出了好幾個他們這麼認為的理由:
Line counts vary wildly from task to task (sometimes due to large auto-generated files), so there is a lot of “noise” in this measurement. Dividing two noisy numbers (lines of code in AI Allowed tasks vs. AI Disallowed tasks) yields a very noisy result. The observed difference is “not statistically significant”4.
每個任務的程式碼行數差異超大(有時候是因為自動產生的檔案很大),所以這個測量方式有很多「雜訊」。拿兩個充滿雜訊的數字(允許用 AI 的任務程式碼行數 vs. 不允許用 AI 的任務程式碼行數)來相除,結果當然也是雜訊一堆。所以觀察到的差異「在統計上沒有顯著意義」4。The study authors examined many different metrics; it’s to be expected that at least one will show a spurious result (this XKCD comic hits the nail on the head).
這研究的作者們看了超多不同的指標,所以至少有一個會出現假結果也是意料中的事啦(這點用這張 XKCD 漫畫來解釋真的超到位)。In manual review, the authors saw little evidence of material difference in the nature of the AI Allowed work; they noticed a slight tendency toward more tests and comments, which could again have been spurious.
人工審查後,作者們發現 AI Allowed 的工作性質其實沒啥實質差異;他們只注意到測試和註解好像多了一點點,但這也可能是巧合啦。
Perhaps the strongest evidence that scope creep did not contribute to slowdown is that increased time for AI Allowed tasks was greater on tasks where developers did not report scope creep:
最能證明範圍蔓延(scope creep)不是導致變慢的主因,大概就是那些開發者沒回報範圍蔓延的 AI Allowed 任務,反而花更多時間了:

這跟如果「範圍蔓延」(scope creep)真的讓 AI 輔助的任務花更多時間,你預期會看到的結果完全相反耶。
Even if the difference in line counts is a real effect, there are potential bad explanations (bloated code, more duplication, unnecessary bells and whistles) as well as good ones.
就算程式碼行數的差異是真的,這背後的原因可能有好有壞啦。壞的可能就是程式碼寫得太肥、重複太多,或是加了一堆沒必要的功能;好的當然也有可能囉。
More time might not mean more effort. Even if developers are spending more time when using AI tools, they might be expending less energy. Reviewing / correcting code is often (though not always!) easier than writing from scratch, and time spent waiting for AI can be used to relax or do other things.
花比較多時間不代表就比較累啊!就算開發者用 AI 工具花的時間比較多,但其實消耗的精力可能反而比較少。畢竟,檢查或修改程式碼通常(雖然不是每次啦!)都比從零開始寫來得輕鬆,而且等 AI 跑的時候,還可以趁機放鬆一下或做點別的事,超讚的!
Overall, it seems possible that the impact of AI tools was not quite as bad as it seems; some of the measured 19% productivity decrease could be paying off in more thorough work and reduced energy drain on developers, and some could be explained by overuse of AI by subjects who were overly focused on their participation in the study. But these tools don’t seem to be helping much, and they might really be making productivity worse. How could that be?
總之,AI 工具的影響可能沒想像中那麼糟啦;測出來的 19% 生產力下降,搞不好是花更多時間把事情做更仔細,讓工程師沒那麼累,也可能是因為那些受試者太在意實驗結果,結果 AI 用過頭了。但這些工具看起來好像沒幫上什麼忙,甚至可能真的讓生產力變差。這到底是怎麼回事啊?
Some Kind of Help is the Kind of Help We All Can Do Without
這種幫倒忙的「幫忙」,我們真的不需要啦!
Based on exit interviews and analysis of screen recordings, the study authors identified several key sources of reduced productivity. The biggest issue is that the code generated by AI tools was generally not up to the high standards of these open-source projects. Developers spent substantial amounts of time reviewing the AI’s output, which often led to multiple rounds of prompting the AI, waiting for it to generate code, reviewing the code, discarding it as fatally flawed, and prompting the AI again. (The paper notes that only 39% of code generations from Cursor5 were accepted; bear in mind that developers might have to rework even code that they “accept”.) In many cases, the developers would eventually throw up their hands and write the code themselves.
根據離職面談和螢幕錄影分析,這份研究的作者們發現了幾個導致生產力下降的關鍵原因。最大的問題是,AI 工具產生的程式碼通常達不到這些開源專案的高標準。開發者花費大量時間審查 AI 的輸出,這常常導致他們需要反覆提示 AI、等它生成程式碼、審查程式碼、發現問題太大直接放棄,然後再重新提示 AI。(這份報告提到,只有 39% 從 Cursor 5 生成的程式碼被接受;而且別忘了,即使是「接受」的程式碼,開發者可能還是得再修改。)很多時候,開發者最後都會直接放棄,自己動手寫程式碼。
Based on the screen recordings, here is where developers spent their time:
Note that this graph reflects percentages of overall task completion time, and AI Allowed tasks took longer on average to complete, so a green bar of a given height represents more time than a purple bar of that same height6. You can see that for AI Allowed tasks, developers spent less time researching and writing code (though, due to the scale issues, the difference was less than visually apparent). Adjusting for scale, they spent roughly the same amount of time on “testing & debugging” and “git & environment”, and considerably more time idle – perhaps because waiting for AI tools causes people to lose flow. In any case, the moderate savings on researching and writing code was more than overcome by the time spent prompting the AI, waiting for it to generate code, and then reviewing its output.
How can we reconcile these results with the constant reports of AI coding tools working miracles?
This Is The Unevenly Distributed Future I Was Telling You About
Back in December, I wrote about how current AI tools are very good at some things, very bad at others, and the dividing line is extremely jagged. That jagged dividing line meanders right through the constellation of work that we call “software development”.
In First, They Came for the Software Engineers…, Taren wrote:
Typically, large productivity boosts occur for small, well-defined, greenfield projects, or when an engineer is first learning a new language or API. For other work, gains from using current AI tools are often far more modest – and potentially entirely offset by increased time needed for review, debugging, integration, and managing AI quirks.
Several aspects of the study play to the weaknesses of current tools. First, it was conducted on mature projects with extensive codebases. The average project in the study is over 10 years old and contains over 1 million lines of code – the opposite of “greenfield”. Carrying out a task may require understanding large portions of the codebase, something that current AI tools struggle with. (This may be less a fundamental weakness of AI models, and more a design choice in some coding tools to limit the amount of “context” sent to the model, in order to control costs and get quicker responses.) It also involved editing large files, which may be “out of distribution” for most AI models (i.e. they may not get much training on large files). The paper includes some anecdotal reports from developers which support this idea:
In software development, developers often rely on their own undocumented knowledge of the codebase to assist design and implementation decisions. In our study, developers often note that AIs lack this tacit codebase knowledge, resulting in less useful AI outputs. One developer notes that AI often acts like a new contributor to the repository, and that “AI doesn’t pick the right location to make the edits.” Another developer notes that while “we [..] know the data that will interact with the code, but the model doesn’t know the data. It doesn’t know we need to take care of this weird case of backwards compatibility and [thus] keep this specific line. And this is very hard to give as [context to the model].”.
We hypothesize that the size and maturity of the included repositories increases the amount of tacit knowledge that experienced developers rely on when completing their work—because AI systems may have less access to this knowledge, it may be more difficult for them to assist experienced developers on these issues.
Second, most of these open-source projects have strict style guidelines. The experienced developers in the study were accustomed to coding according to their project’s guidelines, but the AI tools are not – thus requiring developers to review and fix the AI’s output.
Third, the developers in the study had years of experience working on their projects, meaning that they were able to work very efficiently – posing a high standard for AI to compete with.
There have been other studies on the productivity impact of AI tools in real-world settings. One 2024 study found a 26% “increase in the number of completed tasks”, even though the subjects were using older tools that AI tools have improved dramatically in the last year. The methodology was less rigorous7, but perhaps more important is that this study involved less-experienced developers working on a wider range of projects. The study notes that “less experienced developers showed higher adoption rates and greater productivity gains”, consistent with the idea that current AI tools are less useful for the experienced developers in the new METR study.
Judge LLMs in Competitive Programming?
The observed 19% productivity decrease stands in particularly sharp relief with AI scores on coding benchmarks, where they are often found to rank at a human-elite level on coding competitions (though a recent study has called this into question). Coding competition problems are exceptionally small, well-defined, isolated, and greenfield (starting from scratch rather than working in an existing codebase), thus playing directly to AI’s strengths.
Enthusiastic anecdotes about time savings from AI tools often originate from within the big AI labs themselves. This may partly reflect misplaced enthusiasm – remember that the participants in this study believed that AI tools were speeding them up, even as they slowed them down. But it’s also the case that some of the work that goes on in those labs is better suited to AI tools, from coding up a small training experiment to adding a UI element to a chatbot. (It’s worth noting that not all lab insiders report significant value from AI coding tools; this might reflect the range of work.) And it’s possible that engineers at the AI labs are more experienced at using their own tools and benefit from sharing tips with their coworkers.
Here are a few references to recent papers with additional data points; I have not read the papers:
Zvi Mowshowitz (source):
New paper introduces LiveCodeBench Pro, which suggests that AIs are not as good at competitive programming as we have been led to believe. Some top models look like they weren’t tested, but these scores for the same model are lower across the board and all were 0% on hard problems, so the extrapolation is clear. [This benchmark is noteworthy because some of the problems are not published to the Internet, thus avoiding “memorization” issues which plague many coding benchmarks.]
Derek Thompson (source, including a link to the paper):
New paper: When a company goes from 0--> 30% of its code written by AI, a key measure of productivity only increases by 2.4%
Ethan Mollick (source):
Individuals keep self-reporting huge gains in productivity from AI & controlled experiments in many industries keep finding these boosts are real, yet most firms are not seeing big effects. Why? Because gaining from AI requires organizational innovation.
What To Take Away
This study was conducted in early to mid 2025. AI models are only going to get better from here. The coding applications built on those models, like Cursor, are going to keep improving to make better use of the models8. And developers are going to get better at using those applications efficiently and effectively – posing the right kinds of tasks, and providing enough context for the tool to do what they want. Things might change rapidly; the modern wave of LLM-based coding tools is only a couple of years old!
AI tools are also going to expand to address more of the software developer’s job, including reviewing code written by other developers, testing, and even reviewing and testing code written by other AIs.
The study’s finding of a 19% performance decrease may seem discouraging at first glance, but it applies to a difficult scenario for AI tools (experienced developers working in complex codebases with high quality standards), and may be partially explained by developers choosing a more relaxed pace to conserve energy, or leveraging AI to do a more thorough job. And of course results will improve over time. The paper should not be read as “debunking” the idea of an AI 2027-style software explosion, but it may indicate that significant feedback loops in AI progress may be farther away than anticipated – even if some aspects of AI research involve small throwaway projects that may be a better fit for AI coding tools. Meanwhile, it remains to be seen whether AI is generating bloated or otherwise problematic code that will cause compounding problems as more and more code is written by AI.
But perhaps the most important takeaway is that even as developers were completing tasks 19% more slowly when using AI, they thought they were going 20% faster9. Many assessments of AI impact are based on surveys or anecdotal reports, and here we have hard data showing that such results can be remarkably misleading.
Thanks to Nate Rush and Joel Becker for providing us with early access to the study, answering my incessant questions, and providing feedback on this post, and to Daniel Kokotajlo, Joyce Er, and Taren Stinebrickner-Kauffman for additional perspective and feedback.
Time measurements were self-reported by the developers. However, later analysis showed that alternate time measurements, such as the length of the video recording or the wall clock time until a pull request is created, yield similar results.
That is, the ratio of (actual time taken) to (up-front estimated time if no AI tools were allowed) was 19% longer for AI Allowed tasks than for No-AI tasks.
That is, the number of lines added divided by the number of hours the developer predicted a task would take to complete, was 47% more for AI Allowed tasks.
The p-value used for statistical significance is not stated in the paper, but an earlier draft made reference to p=0.05.
Note that this includes agent mode as well as autocomplete.
The authors note that this graph represents only some of the tasks completed in the study, because it was too much work to review and label all of the screen recordings. Many of the tasks where the largest slowdown occurred are not reflected in the graph, “to some extent because it's cheaper to pay labelers to [review] shorter videos”.
For one thing, in the 2024 study, developers defined tasks to complete after learning whether they would be using AI tools or not.
“Agentic” coding tools like Claude Code and OpenAI Codex have been getting strong reviews, but are so new that none of the developers in this study had adopted them.
This despite the fact that, as the paper notes, these developers may be better calibrated than average, since the structure of the study encouraged them to pay attention to how they were spending their time.













Very interesting study and great post!
I was surprised by the headline result, but the explanation does make sense and tracks with my own experience. I find ChatGPT pretty useful for coding, but: 1) I’m not a professional software engineer, my coding is for scientific research; and 2) it’s most useful when I’m trying to learn something new. I’ve definitely wasted time trying to get ChatGPT to do something that I could’ve done myself. (I’d say it’s analogous in those situations to just kind of brute forcing various changes in the code and seeing what works.)
First, thank you, Steve, as Abhay Ghatpande suggested, for this high-quality post, which provides a balanced and objective review of the study.
One of the striking elements is that developers consistently overestimate AI's impact on their productivity by nearly 40 percentage points (from a -19% actual to a +20% perceived increase), highlighting that subjective productivity assessments in the AI era may be fundamentally unreliable without objective measurements. With all the possible biases at play, this is not surprising and reminded me of some of the insights from https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow by https://en.wikipedia.org/wiki/Daniel_Kahneman
It also helps reinforce the importance of measuring ROI for both objective and subjective metrics to understand the benefit and impact of AI that organizations leverage.