World's 'first AI software engineer' fails 85% of its assigned tasks

A service touted as the 'first AI software engineer' has been debunked and found to have only successfully completed 15% of its tasks.

Jak Connor

Tech and Science Editor

Published Jan 23, 2025 12:31 PM CST

1 minute & 45 seconds read time

TL;DR: Devin, an AI tool by Cognition AI, claims to autonomously build and fix code but has limitations. Criticized for errors and inefficiency, it completed only a few tasks successfully. Despite its potential, reliability issues persist.

In the midst of the 'AI Revolution', there's been plenty of speculation about AI taking away jobs, and no sector has been dealing with those fears more than the software engineering industry.

However, programmers can rest assured that one of the latest tools touted as a fully autonomous AI software engineer reportedly has its limitations. Devin is an AI programming tool originally released by Cognition AI in March of 2024. The tool, hailed as the "first AI software engineer," ignited a range of concerns for programmers with fears regarding job security. Particularly given that some of the claims included the ability to "build and deploy apps end to end" and "autonomously find and fix bugs in codebases."

Following its release, Cognition uploaded a video entitled "Devin's Upwork Side Hustle", which essentially claimed that the tool could make money through the completion of Upwork tasks. In April 2024, Veteran software developer Carl Brown of the YouTube channel Internet of Bugs quickly took to the platform to debunk some of the tool's claims, citing criticisms such as:

Read more: AI Agents like OpenAI's 'Operator' have a long way to go before replacing humans

"Devin didn't complete the advertised task. Instead, it generated errors in its own code and then fixed them"

Shortly after, the original poster of the Upwork ad released a video supporting [Brown's] claim.

Devon was rolled out to the general public in December 2024 with a price of $500 per month. Since then, the feedback has been somewhat similar. Three data scientists from Answer.AI, a reputable AI research and development lab, tested Devin and found that they only completed three out of 20 tasks successfully.

Another analysis conducted by engineers Hamel Husain, Isaac Flath, and Johno Whitaker followed a similar pattern. Stating that "tasks that seemed straightforward often took days rather than hours" and that it had the concerning tendency to "press forward with tasks that weren't actually possible". The researchers did credit the tool, noting that it was impressive - when it worked. However, they concluded the statement, exclaiming "that's the problem - it rarely worked."