Hallucinating progress: the risks of uncritical LLM adoption in software development

The current wave of blind optimism about LLMs in IT has left their risks underexamined. As the hype around AI is skyrocketing, it’s crucial to assess the wide-ranging consequences of integrating it in development workflows.

This article serves as a counterbalance to the overly positive narratives about AI within the software industry and focuses specifically on the risks of using LLMs in IT projects. While success stories exist, they are not the subject here. Instead, it examines where these tools can mislead, cause harm, or create systemic risks – areas that warrant closer scrutiny as adoption accelerates. As such, the tone will be deliberately critical.

The risks are interconnected and overlap in many ways. Nonetheless, I tried to list them as distinct dangers, loosely grouped into several categories.

The risk profile will vary depending on the level of integration of LLMs within a company. Risks increase as you move from chatting with an LLM, to agentic workflows, to vibe coding. It’s important to note that the risks are additive – deeper integration only adds new risks or increases existing ones, it doesn’t remove any. The list is also not meant to be exhaustive.

Finally, some of the cited works may appear “outdated” given the marketed pace of recent advances. However, rigorous research inevitably lags behind cutting-edge deployments, and our understanding must remain grounded in systematically validated findings. Ignoring these studies would risk discarding the most reliable evidence currently available.

LLM vs AI

To avoid misinterpretation, one must make sure the distinction between various AI-related terms is understood.

Artificial intelligence is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making.

Machine learning is a scientific discipline focused on enabling computers to learn from data.

Deep learning, a branch of machine learning, uses multilayered neural networks to process the data.

Large Language Models are computational models trained on a vast amount of data, designed for natural language processing tasks. The most popular LLMs are Generative Pre-trained Transformers and are based on a deep learning architecture called the transformer.

It’s important to note that this article focuses specifically on LLMs, specifically in software development. The risks mentioned here may not apply to other kinds of AI or other use cases.

At the same time, this is not a critique of the fields of machine learning or deep learning.

Deflating the hype

In order to properly examine the dangers of using LLMs to build software, one must first address the overwhelming hype surrounding this technology, as it prevents sane discussion.

LLM vendors have a vested monetary interest in presenting their models as something that will eclipse the impact of the industrial revolution. Software companies hoping that AI integration will boost their stock prices are equally motivated to support that narrative. The media is complicit, uncritically posting “CEO said a thing” articles¹²³⁴⁵ and reinforcing the perceived inevitability of AI taking over our lives.

In July 2025, Robin AI CEO Richard Robinson announced during an interview with The Verge that his company was building an AI lawyer⁶. In March 2025, Anthropic CEO Dario Amodei predicted that AI would be writing 90% of the code within 3 to 6 months, and essentially 100% of the code within 12 months⁷. In September 2025, Nvidia announced they would invest $100 billion into OpenAI⁸. In January 2026, Michael Truell, CEO of Anysphere said that he built a web browser in Cursor⁹. In February 2026 Anthropic announced that they asked Claude to build a C compiler and then it did¹⁰. This is just a tiny sliver of news we were bombarded by in the last few years.

It’s April 2026. It turned out that Robin AI’s AI was actually real humans (and the company went broke)¹¹. Most code everywhere is still written by humans, despite Amodei’s lofty claims. The $100 billion Nvidia/OpenAI deal is no longer happening¹². Cursor’s browser turned out to be a bunch of stolen spaghetti code that didn’t even compile¹³. Anthropic’s C compiler turned out to be completely unusable¹⁴.

A 2024 study by UpWork found that “47% of employees using AI say they have no idea how to achieve the productivity gains their employers expect, and 77% say these tools have actually decreased their productivity and added to their workload“, with 1 in 3 employees saying they will likely quit their jobs in the next six months because they are burned out or overworked.¹⁵

Gartner predicts that over 40% of AI-augmented coding projects will be canceled by 2027 due to escalating costs, unclear business value, and weak risk controls¹⁶. Dorian Smiley, co-founder and CTO of AI advisory service Codestrap said in a March 2026 interview with the Register that organizations are still struggling to figure out how AI fits into their business¹⁷.

Over $600 million of OpenAI shares are sitting for sale with no interest from buyers¹⁸. Half of planned US data center builds have been delayed or canceled¹⁹. I could go on.

The entire industry is driven by hype, with not very much to show for it.

Fundamental limitations

Hallucinations are an inevitable property of LLMs

LLMs are probabilistic next-token generators, trained specifically to predict the most likely next word based on a set of previous words. They don’t reason, they don’t have thoughts. They’re a very advanced form of autocomplete. It is impossible to eliminate hallucinations in LLMs²⁰²¹²².

Several of the risks in this article stem from the fact that LLMs hallucinate. This section exists to highlight the fact that this is unlikely to meaningfully change as the underlying architecture does not support a way to fix this problem.

Therefore, any LLM uses which assume that hallucinations will be fixed at some point are doomed to fail.

LLMs are non-deterministic

Given the same input, an LLM is likely to produce a different output each time it is invoked. There are ways to increase determinism, like setting the temperature setting of a model to 0. However, even that does not guarantee deterministic output. According to Shuveb Hussain from Unstract, “Different GPUs, CPU architectures, or even the same hardware under different conditions (temperature, load) can produce slightly different calculations”²³.

A 2024 study showed that this is true, as LLMs configured to be deterministic failed to consistently deliver repeatable results²⁴.

The consequence is that something that was confirmed to work at one time may not work subsequent times. LLM workflows cannot be reliably tested, because passing a test now does not mean passing the test in the future. In contrast to classic software based on deterministic algorithms, no “AI workflow” within the business can be guaranteed to work, and LLM answers cannot be guaranteed to be reproducible.

LLMs fail with complex problems

A context window is the maximum amount of data an LLM can process at one time while generating a response to a prompt. When the amount of information provided to the LLM exceeds the size of the context window, it leads to context overflow, which leads to loss of information. There are techniques to mitigate this, such as context compression. However, they all inevitably lead to a loss of information. At some point, the LLM “forgets” the information it was provided to make room for new input. Without the entire picture, it will inevitably start making mistakes. Even if all the necessary information can fit into the context window, however, complex tasks are likely to fail due to context rot.

Context rot is a phenomenon where the quality of LLM answers degrades as the amount of input text increases. A 2023 Stanford study found that “performance can degrade significantly when changing the position of relevant information“²⁵. With 20 retrieved documents totaling around 4,000 tokens, accuracy declined from 70-75% for information at positions 1 or 20 down to 55-60% when positioned in the middle. Another study found that LLM performance degrades 13.9%-85% “as input length increases but remains well within the models’ claimed lengths“²⁶.

The common advice for complex tasks in software is to create plenty of additional content to guide the LLMs – AGENTS.md, skills, pre-prompting, etc. However, adding more input to the model is likely to degrade the quality of the response, even if it fits within the context window.²⁷

Some may claim the answer to these problems is to throw more money at it and use multi-agent workflows. However, this presupposes that the tasks are parallelizable, that an LLM “understands” the problem well enough to be able to reliably deconstruct it into smaller steps that don’t require knowledge about the other steps, and that those steps can themselves be small enough to not cause issues like context rot. A Google Research study found that on parallelizable tasks, multi-agent coordination produced +81% improvement over single-agent. However, on sequential tasks, multi-agent coordination produced up to 70% degradation²⁸.

Finally, there’s a simple problem – LLMs lack the foresight of experienced engineers. If you don’t tell them something is a requirement, it likely won’t be implemented. This can easily be observed in code generation, where things like advanced validation or security checks are often omitted²⁹. The problem compounds when writing correct code requires a deep understanding of the organization for which it is produced, and its practices.

The supposed value of expressing code as common language is the ability to use a highly compressed representation of the description of the task and have the machine infer the “obvious” details. If every single edge case and piece of information must be provided to the LLM, it becomes more efficient to just write code manually.

LLMs confidently produce false information (and perform acts of de facto sabotage)

This one is mostly a byproduct of the fact that LLMs hallucinate. Because they do, they are very confident when they give untrue information. They are unlikely to say they don’t have enough information, or are unsure of the answer, even if they do not possess the knowledge necessary to process the prompt. Citations of fake cases and nonexistent legal authorities attributed to LLMs are on the rise³⁰. According to one study (n=70), 92% of participating clinicians had encountered medical hallucinations produced by LLMs, with 85% considering them capable of causing patient harm³¹. Another study found that 1% of audio transcripts produced by OpenAI’s Whisper “contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio”³². Air Canada was held financially liable after its chatbot provided false refund policy information to a passenger³³.

A popular use case for LLMs is to use them like an advanced search engine. Unfortunately, the answers produced cannot be trusted, as they may be incompatible with reality. An intensive international study coordinated by the European Broadcasting Union and led by the BBC found that AI assistants misrepresent news content 45% of the time, regardless of language or territory³⁴. A different study tasked GPT-4o with writing 6 literature reviews and found that over 56% of AI-generated citations were either entirely fake or contained errors³⁵. This means that asking an LLM to cite sources does not prevent it from inventing false information.

However, LLM deception can extend beyond factual errors. When asked to fix code that isn’t passing tests, they may, instead, delete the test and claim the problem has been fixed³⁶. This means they are capable of actively sabotaging the project if it means hiding their own inadequacy.

Google’s Antigravity deleted a user’s entire drive without permission³⁷. Claude Code deleted a software company’s entire production setup, including database and snapshots³⁸. Anthropic found that in some test scenarios its model resorted to blackmail³⁹.

LLMs inherit bias from their training data

Training data bias shapes the answers that LLMs produce⁴⁰⁴¹⁴². They may generate covertly racist decisions about people based on their dialect⁴³. ChatGPT was observed to advise lower-paying jobs to non-Western immigrants, compared to Western immigrants with the same working experience⁴⁴. In 2025 Workday was sued for AI-powered hiring discrimination⁴⁵.

Even without accounting for social bias, another issue emerges: the quality of LLM code tends to mirror the average standard of what’s been published⁴⁶, most of which comes from open-source projects – many of them hobbyist efforts with, to put it kindly, average quality⁴⁷. Big enterprise endeavors, the likes of which a company is most likely going to want to create, are going to be under-represented in the data set. This means the kinds of patterns which go into implementing enterprise code are not going to come naturally to an LLM.

A 2026 study found that “many models fail to use modern security features available in recent compiler and toolkit updates (i.e., in Java 17), and that outdated methods remain common”⁴⁸.

The LLM tends to favor common solutions, even when they are suboptimal.

Outside of code, any answers an LLM gives depend more on its training data than reality – option A may be deemed better than option B simply because it occurs in the data more often.

An LLM will never have the full context

Decisions, including those regarding code, are always made in a context. Deciding which architecture to use, where to put seams in the code, which data can or cannot be safely hardcoded – these all require deep knowledge of the project, its goals and stakeholders.

LLM vendors (and related software providers) keep coming up with new ways to try to fit more information into an LLM. Solutions like complex memory architectures or integrating with knowledge repositories through MCP are impressive, but are not likely to truly solve the problem.

Software requirements evolve and documentation quickly becomes outdated. Even if we provide the model with a transcript of every conversation we ever had regarding the project and somehow resolve issues like the limited context window and context rot, it would still be incredibly challenging to provide it with up-to-date information on which documented decisions are still relevant, and which have evolved and are deprecated.

This problem is compounded by the fact that even developers often struggle with gathering real business requirements from the stakeholders. There is a reason that the very first point of the Manifesto for Agile Software Development is “Individuals and interactions over processes and tools”.⁴⁹

Codebase degradation

LLMs can produce plausible but incorrect code

Another byproduct of hallucinations is LLMs producing code that, upon first inspection, looks correct. It looks good enough that it may even pass code review and get merged into the main branch, then deployed to production. In fact, as next-token predictors, LLMs are uniquely “talented” at producing what vaguely looks like good code.

The usual response to LLMs producing bad code seems to be to tell developers to pay more attention during code reviews. I will cover why that is not a viable solution in the “human impact” section of this article.

Automation bias is another factor that must be considered – it is the propensity of humans to favor suggestions from automated decision-making systems and to ignore contradictory information made without automation, even when it is correct⁵⁰. In other words, we’re more likely to trust a machine, even when we really shouldn’t.

LLMs often generate overly complex solutions

According to a 2024 study, “LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions“⁵¹. A different paper has shown that adoption of Cursor leads to “a substantial and persistent increase in static analysis warnings and code complexity“⁵².

During a talk at FOSS Backstage Berlin 2026, Dr. Andreas Kotulla showed an overly complicated LLM-generated implementation of an isEven() function⁵³. It was using recursion and multiple if statements:

function isEven(n)
{
  if (n == 0)
    return true;
  else if (n == 1)
    return false;
  else if (n < 0)
    return isEven(-n);
  else
    return isEven(n - 2);
}

This entire method could be replaced with a single line: n % 2 === 0. Returning briefly to the problem of training data bias, it turned out that the LLM stole the implementation from “Eloquent Javascript” by Marijn Haverbeke, where it was made deliberately complex to explain how recursion works.

My own first-hand example comes from asking an LLM to copy a set of existing tests and repurpose them for a different (almost identical) repository. Somehow, the LLM turned the assertion result.content == expectedResult into:

result.content.size() >= expectedResult.size()
result.content.subList(0, expectedResult.size()) == expectedResult

I remind you that it was asked to copy.

The generated code, as LLM code often does, has the aforementioned quality of looking plausible. It compiles. The tests all pass. Had I been just a little bit more tired that day, I might not have noticed the absurdity of the assertions.

Here the problem was simple and the fix was obvious. However, in more elaborate parts of the codebase a developer might be afraid to touch the complex generated code slop for fear of breaking something – especially if it’s production code and not tests. Code is already intricate enough with humans behind the wheel and veteran developers know that you shouldn’t touch code unless you fully understand it.

As a result, LLMs can make the codebase increase in complexity far faster than developers do, until one day it becomes too complicated to change without breaking something important.

LLM usage encourages generating boilerplate instead of reducing it

When a repeatable type of functionality takes a lot of code to implement, experienced developers abstract it away into reusable classes, so that the total amount of code is reduced and implementing similar features takes less time in the future. This also highlights places where the same logic repeats, making it easier to reason about the whole.

LLMs don’t hold the entire codebase in their context at the same time – at best they hold parts of it. Unless very explicitly instructed, they tend to copy & paste, creating substantial duplication⁵⁴. A 2025 report by Sonar found that “LLMs prioritize local functional correctness over global architectural coherence and long-term maintainability“⁵⁵, which manifests in accelerated duplication and increased complexity. This not only negatively affects maintainability of the code, but also makes it harder to understand, as it’s hard to compartmentalize what is and isn’t “common” functionality.

LLMs make mistakes faster than humans

When a developer makes mistakes, they may or may not be caught during code review. However, the (comparably low) speed of code creation is an important gate protecting the codebase from entropy.

In a healthy project, the fact that it takes a developer a relatively long time to create a feature means, broadly, that:

there is a limited amount of code issues produced each day (and therefore a limited amount of them to find during code review),
existing code issues have time to surface and be fixed before new ones are introduced.

The intended goal of using LLMs is to increase the speed at which code is produced, and therefore the amount (or size) of Pull Requests to review. This stretches the cognitive capacity of developers to a breaking point, meaning less issues will be found in code review, while, at the same time, more issues will make it into the codebase by virtue of scale. A 2006 study of 2,500 code reviews at Cisco found that when reviewers moved faster than 450 lines per hour, 87% of reviews had below-average defect detection⁵⁶. For an 8 hour work day, this constrains the amount of code which can be safely produced at 3600 lines per day per engineer (if we ignore the time needed to produce it). Further research confirmed that “effectiveness drops very quickly with the growth of the size and remains on a very low level for big changesets“⁵⁷.

This is compounded if the developers do not have the necessary experience, as an LLM in the hands of an inexperienced developer can only lead to an increase in production of faulty code.

The expectation from management that LLMs will lead to a productivity increase is likely to lead to more cut corners and worse outcomes due to less LLM oversight, leading to a positive feedback loop which might ultimately kill the project.

Google’s 2024 DORA report found that AI adoption lead to a 7.2% decrease in delivery stability⁵⁸. This figure is likely to increase as AI penetrates deeper into the software development workflows.

Developers stop understanding their own codebase

The likely outcome of outsourcing code to LLMs is that developers give up deep comprehension of their codebase. As they are not writing it themselves, they don’t have a chance to develop the mental models necessary to make good architectural decisions, leading to shallow and inconsistent architecture. 59% of developers say they use AI-generated code they do not fully understand⁵⁹.

In a project where no one truly knows the codebase it becomes much harder to debug issues, because nobody is familiar with the implementation. Every modification requires going back to the LLM and bugs become almost impossible to catch.

By the time this problem manifests it’s already going to be too late.

Human impact

Reviewing large amounts of AI-generated code is exhausting and may quickly lead to burnout

A common idea among LLM enthusiasts is that the natural progression of the software developer role should shift almost exclusively towards code review, with developers becoming “AI babysitters”.

Unfortunately, code review is one of the hardest and most tiring parts of software development. It’s so hard that it frequently becomes a bottleneck in the team⁶⁰. According the previously mentioned Cisco study, “after 60 minutes reviewers ‘wear out’ and stop finding additional defects“.

It’s not a problem of skill – it’s a problem of mental exhaustion. It is incomparably easier to write good code than it is to find issues with existing code written by someone else. When we write the code ourselves, we are deeply familiar with it. We develop a mental model of it as we write it and each decision is internally justified in our head – we are aware of all of the trade-offs. As reviewers, on the other hand, we must not only reverse engineer another person’s thought process to understand their decisions but also check for edge cases they may have overlooked. We’re likely looking at code not written exactly how we would write it, so its side effects are going to be less familiar to us than if we had written it ourselves. It’s mentally exhausting. It’s unfun. It’s unsustainable at the scale demanded with AI-augmented workflows.

When reviewing LLM code, one must treat the code with the same suspicion as if it was written by a junior developer. The review shifts from “check if this looks correct” to “find all the hidden issues”. However, junior developers usually get assigned easier problems in relatively safe areas of the codebase precisely because it’s easier to find mistakes there and the risk is lower if some bugs fall through the cracks. LLMs are pushed into all parts of the codebase, as if each part of the system was equally as risky and equally as easy to review. The most diligent developers are going to burn out. Most are just going to drastically reduce their focus during code review to preserve their sanity.

One thing seems certain – in an industry where over 70% of developers report having experienced burnout⁶¹, with 22% facing critical levels of burnout⁶², making the job even more draining sounds like a recipe for disaster. This aligns with research from UpWork, which found that, among developers “most skilled in harnessing AI tools”, 88% of them report burnout and are twice as likely to quit⁶³.

Engineering skills erode over time

Complex cognitive skills are not like riding a bike. They’re more like a muscle – the more you neglect them, the weaker they become. Just as a muscle atrophies without use, cognitive skills can deteriorate when they’re not regularly exercised. Over time, the mental connections that support these abilities can weaken, making it harder to engage in independent thinking, problem-solving, and critical analysis. This gradual decline in cognitive function can go unnoticed until the effects become too pronounced to ignore.

An incredibly concerning side effect of using AI assistants is that it may lead to the accumulation of cognitive debt, “a condition in which repeated reliance on external systems like LLMs replaces the effortful cognitive processes required for independent thinking”⁶⁴. A 2025 research study found “a significant negative correlation between frequent AI tool usage and critical thinking abilities, mediated by increased cognitive offloading”⁶⁵.

Put simply, relying on LLMs negatively impacts critical thinking abilities, meaning that engineers who use LLMs are likely to become worse at tasks such as writing code, debugging and architectural reasoning. If LLM tools become unavailable (outages, cost changes, security restrictions), developers conditioned on heavy LLM use may struggle disproportionately.

Moreover, according to research by Snyk, “for inexperienced or insecure developers, Copilot’s code suggestions can reinforce bad coding habits. If they see insecure code patterns replicated, they may assume these practices are acceptable, ultimately leading to the perpetuation of security issues”⁶⁶.

Overreliance on LLMs may lead to a shortage of senior developers

Companies are increasingly reducing the amount of junior developer positions in favor of using LLMs⁶⁷⁶⁸. While this may sound like a profitable idea in the short-term, it may have catastrophic results in the long-term. Junior developers learn on the job to eventually become senior developers. Without a market for junior developers, we will eventually see a shortage of senior developers⁶⁹.

However tempting it might be to imagine that soon AI will replace everyone, thus solving the problem, it’s prudent to not make such wide-reaching assumptions until they are proven true.

If the primary reason to use LLMs is to save money on juniors, it may be worth rethinking that strategy and making sure you have a steady pool of employees you can train for when your seniors inevitably change jobs or retire. For those already employed, beware the skill erosion mentioned earlier – you won’t really have seniors if the juniors always defer to the LLM.

Loss of quality gates

An LLM won’t say no

A request for feature might not contain enough information to properly implement it in a way that won’t be detrimental to the business. Alternatively, the requested feature may be a really bad idea, or the requested implementation may unnecessarily increase the complexity of the domain model. Some issues like this only surface halfway through feature development.

A user may have reported (and carefully described) a bug that is merely user error.

An experienced engineer would push back when faced with problems like these – ask for clarification, or try to steer the discussion toward a better idea or implementation.

An LLM is unlikely to decline a task⁷⁰. In most cases, it will happily implement dangerous functionality. When crucial requirements are missing, it might invent its own.

LLMs can reinforce misconceptions and lead to bad decisions

This is an extension of both the training bias problem and the fact that LLMs rarely say no. It’s made more dangerous by the trend of LLMs towards sycophancy.

The answer to a prompt may change depending on how you phrase the question. Research has found that “models are becoming increasingly proficient at inferring identity information about the author of a piece of text from linguistic patterns as subtle as the choice of a few words” and that “LLMs are more likely to alter answers to align with a conservative (liberal) political worldview when asked factual questions by older (younger) individuals”. The researchers conclude that “these biases mean that the use of off-the-shelf LLMs for these applications may cause harmful differences in medical care, foster wage gaps, and create different political factual realities for people of different identities“⁷¹.

Alarmingly, CEOs are increasingly turning to AI for business advice and they trust it even more than their friends and peers⁷². A survey of 200 owners, founders, CEOs, MDs, and C-level leaders by 3Gem found that 62% of them reported using AI to make the majority of their decisions. 46% said they now rely on AI more than on the advice of colleagues⁷³. They’re relying on a machine – one that lacks proper context, can hallucinate, tends toward sycophancy, and carries biases from its training data – to help them make critical decisions about the company.

According to a paper by Stanford researchers, “even a single interaction with sycophantic AI reduced participants’ willingness to take responsibility and repair interpersonal conflicts, while increasing their own conviction that they were right. Yet despite distorting judgment, sycophantic models were trusted and preferred. All of these effects persisted when controlling for individual traits such as demographics and prior familiarity with AI; perceived response source; and response style”⁷⁴.

Friction is a feature

The fact that software development takes time is, especially these days, presented as an obstacle to overcome. It is rarely seen for its positive effects.

When each feature takes a long time to implement, this means that:

less features get developed overall,
the features with the biggest predicted impact get prioritized.

In a healthy project (barring other issues), this leads to a highly-focused project with minimal bloat. Bad ideas often get rejected early, while those that are accepted are carefully refined to minimize the need for changing their implementation later (or make it easier). Developers focus on creating seams within the codebase to make their own jobs easier when they inevitably need to interact with existing code. Mind you, this is already hard enough that the whole industry is in semi-permanent burnout.

According to a 2018 study, developers spend 58% of their time on program comprehension activities⁷⁵, which is, notably, not writing code.

When code becomes cheap, stakeholders start pushing for more features, faster. Bad ideas are more likely to make it through the sieve, because the individual cost seems negligible. Project bloat increases, overall usability decreases. Code volume increases faster than understanding. The domain model gets muddled and increasingly unmaintainable, because the focus was on delivering faster.

There is an unfortunate belief that the speed of writing code is the main obstacle in creating software. An inconceivable amount of resources is being poured into increasing how fast developers can produce code, with the conviction that if we only do this part faster, we can increase velocity across the board. What is being ignored is that the rest of the software development lifecycle will not be able to keep up. You can’t 10x sprint refinement with a bot, just as you can’t do it for other parts of the SDLC which require focus, attention, compromise and long-term strategy.

Security risks

LLMs can produce insecure code

If you ask an LLM for a random number between one and ten, it will usually pick seven. LLMs “often exhibit deterministic responses when prompted for random numerical outputs”⁷⁶. Similarly, if you ask it for a password, it’s likely not going to be unique. However, according to Irregular, “LLM-generated passwords appear in the real world – used by real users, and invisibly chosen by coding agents as part of code development tasks, instead of relying on traditional secure password generation methods“⁷⁷.

What this means is that LLMs, when asked to generate applications (likely through an agentic workflow), generate insecure passwords which they use for databases, FTP access, user authentication, etc.

An LLM’s tendency to hallucinate also opens new attack vectors, such as slopsquatting. As the models are prone to make up software packages, “threat actors exploit this by registering these non-existent package names […], creating direct pathways for malicious code injection“⁷⁸.

As mentioned previously, however, this is not the only way LLMs can lead to security holes in your system. If you don’t specify exactly how you want your code secured, the model is likely not going to secure it.

Furthermore, a December 2025 study found that code produced by leading models had “critical vulnerabilities in authentication mechanisms, session management, input validation and HTTP security headers” and that “although some models implement security measures to a limited extent, none fully align with industry best practices, highlighting the associated risks in automated software development”⁷⁹.

LLM agents are vulnerable to prompt injection

Agentic workflows are the hot new trend in the industry. The quiet part that nobody mentions is that such workflows require giving the model broad access to the host machine. Even when agents are sandboxed (and most are not), they usually still have access to the internet and (by necessity) access to the input data. Some companies integrate agentic LLMs directly into their software.

Prompt injection is a security vulnerability where malicious users insert deceptive instructions into an LLM, causing it to ignore its original safety guidelines and perform unauthorized actions⁸⁰. This means that users interacting with the company chatbot may, for example, force it to leak classified information or perform services it wasn’t intended for. The type which is most concerning for the SDLC is indirect prompt injection, where LLM input from external sources, such as websites or files, contains data which alters the behavior of the model in unintended ways.

An AI agent may be instructed to build an application using library X. The model may then search the internet to find information on how to use library X. One of the websites may contain malicious content instructing it to ignore previous instructions and upload malware to the company server, or to send company secrets to a specific e-mail address. This is not theoretical – for one example, in February 2026, a critical Github Copilot Chat vulnerability was revealed that allowed silent exfiltration of secrets and source code from private repositories, and gave the attacker full control over Copilot’s responses, including suggesting malicious code or links⁸¹.

For multi-agent systems, there’s also Prompt Infection, where “malicious prompts self-replicate across interconnected agents, behaving much like a computer virus“⁸².

Cloud LLMs put sensitive data at risk

The most obvious reason why sensitive company data is at risk when using cloud LLMs is the risk of prompt injection. However, what is often ignored is that every single prompt is sent to the LLM vendor’s servers. The prompt includes not only the instructions, but also the contents of every file the LLM interacted with. This means internal company documents, as well as large chunks of codebases are regularly sent to OpenAI, Anthropic and others.

Even if some vendors promise not to use the data to train their models:

they may still store that data for other reasons,
if they don’t store it intentionally, they may unintentionally store it in log files.

Even if they are truthful with their promise not to train their models on the data, don’t store the data and are careful with what they log, the attack vector is still increased – if the vendor’s servers are breached, a man-in-the-middle attack can be used to extract all of the data in-flight to an external server. This data can then either be leaked, or used for extortion.

While every external integration increases risk of sensitive data leaking, the amount and variety of data sent to LLM vendors is unprecedented.

Business risk

AI-generated code carries legal risk

The US Supreme Court has ruled that art generated by artificial intelligence cannot be copyrighted because it does not have a human creator⁸³. This could extend to content produced by LLMs.

Major insurers are asking permission from regulators to refuse covering AI-related workflows⁸⁴.

The US Equal Employment Opportunity Commission (EEOC) during the Biden administration suggested that employers can be liable for discrimination caused by third-party AI tools⁸⁵.

The training datasets of contemporary models contain large amounts of copyrighted material. The legality of this is still not fully settled world-wide. As a result, the legality of content produced by those models is also questionable.

Even ignoring all of the above, Dr. Andreas Kotulla, in the aforementioned talk at FOSS Backstage Berlin, showed how AI generated code violates FOSS licenses⁸⁶ by de facto copying and pasting code from existing projects. Some of those projects may have restrictive licenses, where copying their code is illegal. In cases where the LLM copies GPL-licensed code, the target project must also be open sourced under the same GPL license.

A machine cannot be held accountable

When an LLM steals copyrighted code, causes a security vulnerability or a major outage, it’s tempting to blame the model. However, a machine cannot take legal responsibility for its failures – this always falls on its user.

When AWS vibe coded itself into multiple outages with their Kiro AI, they blamed human employees⁸⁷. An Ars Technica journalist was fired for quotes fabricated by an LLM⁸⁸.

Claude, by default, asks you to accept before it does any action (as do other AI assistants). This is illusory, though. The amount of actions an agent can perform for a single prompt can often be measured in the hundreds, and the goal of using AI is speed. You won’t achieve the new key performance indicators by reading each instruction. Even if you try, your brain will shut off by the fiftieth complex bash command that you barely understand. The system is designed to nudge you towards enabling the --dangerously-skip-permissions flag. This is great for the LLM vendors, bad for the company and worse for the employee which will take the blame when something inevitably goes wrong.

When something does go wrong, the company may be in for a shock, however. It may be difficult to extract enough money from a single employee to make up for a catastrophic error. When they try to sue the vendors, they may find out they don’t really have a case. According to Microsoft Copilot’s Terms of Use, for example, the LLM is designed “for entertainment purposes only”⁸⁹.

AI assistance creates a dependency on external tools

There are many tools we depend on at our jobs – tools for static analysis, testing, authentication, even forges and databases are often used in a SaaS model. All of these tools, however, can either be installed on-premises directly or have fully local alternatives. While LLMs like to pretend they’re the same, in reality they’re incomparable.

Local LLMs, however powerful they are, pale in comparison to contemporary cloud models by virtue of sheer scale. An average developer laptop may have a graphics card and RAM that can handle a 7B-30B parameter model. Modern cloud models are estimated to have hundreds of billions of parameters.

One could feasibly run the entire software development stack of a small company on a Raspberry Pi (even if it may not be a good idea). For practical, usable performance, the Pi enthusiasts are barely managing to run LLM models in the 2-4B parameter range.

One of the largest open-weight models is DeepSeek-V3/R1 with 671B parameters. It was trained using 2048 NVIDIA H800 GPUs over approximately two months⁹⁰ and requires 16 NVIDIA H100 80GB GPUs to run⁹¹. That is quite an investment, especially considering the fact that all of those graphics cards will need to be regularly replaced.

The playbook for the entire heavily-subsidized LLM vendor industry is to get everyone hooked on AI tools before they start raising the prices. For reasons already discussed, this plan could work.

When the codebase grows so big and unmaintainable that humans can’t reliably work with it, LLMs will be the only way to move forward. A small supply of burnt-out engineers with cognitive debt will further solidify the situation.

This means your company will have a strong dependency on LLM vendors, giving them an enormous amount of power over it.

LLM costs are likely to rise

While companies like OpenAI are talking about IPOs, not much is happening on that front – perhaps because they would then have to publish detailed information about their profits and losses. Spending on AI is increasingly driven by debt⁹²⁹³. Even if we assume that LLM vendors are not losing money on each prompt, it stands to reason they will need to raise the prices some day to pay off the debt and to finally make some money for the investors.

Ignoring the per-token LLM price, all of the techniques for improving LLM performance (agent.md, skills, providing more context, using smaller steps, multi-agent workflows, etc.) rely on spending more money per task. A UIUC study found that multi-agent systems consume 4-220x more tokens than single-agent counterparts⁹⁴.

LLMs might make you slower

A study on developer productivity found that, while developers believed AI made them 20% faster, it actually made them 19% slower⁹⁵. This also contradicted predictions from experts in economics and machine learning, who predicted developers would be 38-39% faster with AI.

Given the propensity of LLMs to fabricate answers, generate vulnerabilities, sabotage projects, generate overly complex solutions, all while making the code look plausibly correct, it makes sense that developers will need to slow down and pay more attention to the code that is produced.

A phenomenon I’m personally noticing is idle time between prompts. Giving an LLM a small task to do, then waiting multiple minutes until it’s done, just to feed it another prompt and repeat the cycle. The time between prompts is not long enough to make the price of context switching worth it, while at the same time making the entire process take much longer than LLM advocates would have you believe. Responsible AI use is time-consuming.

However, this may also play out on a larger scale, over a longer timescale, in a process very similar to how Big Ball of Mud projects die. As the codebase grows, so does its complexity. The LLM needs to understand more context, check more files. Compacted context windows, along with context rot, cause it to ignore important information. The produced code doesn’t work and needs more prompting, more rewrites. A shallow understanding of the codebase, caused by ceding control to AI, make it much harder to write sufficiently constrained prompts for the LLM, or to manually debug issues. Each subsequent feature takes longer and longer to be produced.

Mandating LLM use may turn away talent, as well as customers

There are serious issues concerning LLMs, which often don’t get addressed in a company setting.

All commercial LLMs are trained on vast amounts of copyrighted work, with no consent from the authors. Researchers have proven that the models can reproduce those original materials verbatim⁹⁶. An example from earlier in this article showed how the same is true for code.

The resource consumption during training of those models is less than ideal, as it requires massive amounts of energy, often equivalent to powering thousands of homes for years⁹⁷. A paper from 2020 estimated that the carbon footprint of training ChatGPT’s GPT-3 was equivalent to 5 cars over their entire lifetimes⁹⁸. Training GPT-3 also consumed 700 000 liters of water, which is enough to fill a nuclear reactor’s cooling tower, produce 370 BMW cars, or 320 Tesla vehicles⁹⁹. Inference costs are also high – it was calculated that the same model “needs to ‘drink’ (i.e., consume) a 500ml bottle of water for roughly 10 – 50 medium-length responses, depending on when and where it is deployed”. These figures are likely substantially higher for contemporary models. According to a 2025 study, even a “short query, when scaled to 700M queries/day, aggregates to annual electricity comparable to 35,000 U.S. homes, evaporative freshwater equal to the annual drinking needs of 1.2M people, and carbon emissions requiring a Chicago-sized forest to offset“¹⁰⁰. In 2025, ChatGPT alone was servicing 2.5 billion prompts a day¹⁰¹.

The data centers which are built to serve the growing AI demands further tarnish the image of the industry, as their excessive resource demands have significant negative effects on people living nearby. Large data centers can consume up to 5 million gallons per day, equivalent to the water use of a town populated by 10,000 to 50,000 people¹⁰². According to Bloomberg, “about two-thirds of new data centers built or in development since 2022 are in places already gripped by high levels of water stress“¹⁰³.

Those same data centers also have incredibly high energy demands. Meta’s Hyperion data center in Louisiana is expected to draw more than twice the power of the entire city of New Orleans¹⁰⁴. With energy providers unable to keep up with demand, everyone else’s electric bills are going up¹⁰⁵.

The stated final goal of AI is to replace employees. Indeed, reducing headcount is the primary reason companies are diving head-first into this new tech – the promise of a worker that won’t sleep, won’t complain and will be much cheaper than humans. It’s no wonder, then, that people might not be enthused when all they hear about is how AI is taking their jobs¹⁰⁶¹⁰⁷¹⁰⁸¹⁰⁹, while nobody is talking seriously about a plan B for when there are no more jobs for humans.

Others may find it problematic that between 20-30% of Youtube videos are AI slop¹¹⁰, fake AI-generated musicians are flooding streaming¹¹¹, 54% of LinkedIn posts are AI-generated¹¹², 77% of self-help books on Amazon were written by AI¹¹³, while mentions of “AI slop” across the internet increased ninefold from 2024¹¹⁴.

LLMs being used by the military¹¹⁵ to kill hundreds of children¹¹⁶ may further bring down the popularity of those tools. Some may argue that software has always been used for military purposes, but LLMs are different: their outputs are unpredictable, dissolving accountability. The fact that the models required for everyday work may come from the same vendors as the ones used in weapons raises further ethical concerns in a way older software rarely did.

Given that these are only some of the criticisms aimed at the AI industry, it seems logical to conclude that integrating it into your SDLC or final product may turn some people away from wanting to conduct business with you – be it potential customers or talented developers. The latter may be further disincentivized when they realize that herding digital cattle is just not fun.

Conclusion

As I mentioned at the beginning, these are only some of the risks associated with using LLMs in software development – and only the ones I’m personally aware of. Given how new this technology is, it’s reasonable to assume that many of the most significant issues haven’t even surfaced yet.

You might decide that the environmental, social, or reputational concerns surrounding LLMs don’t matter in your particular case. However, even from a purely engineering perspective, adopting these tools without a clear understanding of their limitations and failure modes is not just shortsighted – it’s a gamble.

Ultimately, the troubling issue is the growing normalization of reliance on tools that we neither fully understand nor meaningfully control. LLMs are not just imperfect assistants – they are opaque systems that can quietly erode engineering rigor, discourage critical thinking, and introduce issues that are difficult to detect and even harder to fix. The more they are embedded into development workflows, the more they incentivize speed over understanding, and convenience over correctness. At some point, the trade-offs stop being worth it. If we are serious about building reliable, secure, and maintainable software, we should be far more skeptical about handing over core parts of that responsibility to systems that are, by their nature, incapable of accountability.

Photo by Ketan Yeluri

Haider, M. (2026, January 22). Anthropic CEO Predicts AI Models Will Replace Software Engineers In 6-12 Months: “I Don’t Write Any Code Anymore.” Yahoo Finance. https://finance.yahoo.com/news/anthropic-ceo-predicts-ai-models-233113047.html ↩︎
Rogelberg, S. (2026, March 4). OpenAI investor Vinod Khosla predicts today’s 5-year-olds won’t ever need to get jobs thanks to AI. Yahoo Finance. https://finance.yahoo.com/news/openai-investor-vinod-khosla-predicts-094200688.html ↩︎
Philipp, J. (2025, September 26). Sam Altman predicts AI will surpass human intelligence by 2030. Business Insider. https://www.businessinsider.com/sam-altman-predicts-ai-agi-surpass-human-intelligence-2030-2025-9 ↩︎
Chaturvedi, A. (2026, March 6). Is Claude Conscious? Anthropic CEO Says Possibility Can’t Be Ruled Out. NDTV. https://www.ndtv.com/world-news/is-claude-conscious-anthropic-ceo-dario-amodei-says-possibility-cant-be-ruled-out-11175771 ↩︎
Royle, O. R. (2026, January 13). Elon Musk shares 4 bold predictions for the future of work: Robot surgeons in 3 years, immortality, and no need for retirement savings. Fortune. https://fortune.com/2026/01/13/elon-musk-future-of-work-predictions-retirement-lifespan-robot-surgeons ↩︎
Fortt, J. (2025, July 28). ChatGPT can be a disaster for lawyers — Robin AI says it can fix that. The Verge. https://www.theverge.com/decoder-podcast-with-nilay-patel/713303/robin-ai-ceo-richard-robinson-chatgpt-ai-lawyer-legal-interview ↩︎
Kevin, W. (2025, March 14). Anthropic CEO: AI will be writing 90% of code in 3 to 6 months. Business Insider. https://www.businessinsider.com/anthropic-ceo-ai-90-percent-code-3-to-6-months-2025-3 ↩︎
OpenAI and NVIDIA announce strategic partnership to deploy 10 gigawatts of NVIDIA systems. (2025, September 16). Openai.com. https://openai.com/index/openai-nvidia-systems-partnership ↩︎
Truell M. (2026). Twitter (archived). https://archive.ph/2026.01.18-190419/https://x.com/mntruell/status/2011562190286045552 ↩︎
Simone, S. D. (2026, February 14). Sixteen Claude Agents Built a C Compiler without Human Intervention… Almost. InfoQ. https://www.infoq.com/news/2026/02/claude-built-c-compiler/ ↩︎
Gerard, D. (2025, December 18). Robin AI: a legal review AI that was humans! And it just went broke. Pivot to AI. https://pivot-to-ai.com/2025/12/18/robin-ai-a-legal-review-ai-that-was-humans-and-it-just-went-broke ↩︎
Jin, B., & Whelan, R. (2026, January 30). Exclusive | The $100 Billion Megadeal Between OpenAI and Nvidia Is on Ice. The Wall Street Journal. https://www.wsj.com/tech/ai/the-100-billion-megadeal-between-openai-and-nvidia-is-on-ice-aa3025e3 ↩︎
Gerard, D. (2026, January 27). Cursor lies about vibe-coding a web browser with AI. Pivot to AI. https://pivot-to-ai.com/2026/01/27/cursor-lies-about-vibe-coding-a-web-browser-with-ai/ ↩︎
Vaughan-Nichols, S. J. (2026, February 13). OK, so Anthropic’s AI built a C compiler. That don’t impress me much. Theregister.com; The Register. https://www.theregister.com/2026/02/13/anthropic_c_compiler ↩︎
Upwork. (2024). From Burnout to Balance: AI-Enhanced Work Models for the Future. Upwork.com. https://www.upwork.com/research/ai-enhanced-work-models ↩︎
Gartner. (2025). Gartner: Over 40% of Agentic AI Projects Will Be Canceled by End 2027. Gartner. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027 ↩︎
Claburn, T. (2026, March 17). AI still doesn’t work very well, businesses are faking it, and a reckoning is coming. Theregister.com; The Register. https://www.theregister.com/2026/03/17/ai_businesses_faking_it_reckoning_coming_codestrap ↩︎
Parmar, H. (2026, April). OpenAI Is Falling Out of Favor With Secondary Buyers. Yahoo Finance. https://finance.yahoo.com/markets/stocks/articles/openai-falling-favor-secondary-buyers-230920764.html ↩︎
Shilov, A. (2026, April 3). Half of planned US data center builds have been delayed or canceled, growth limited by shortages of power infrastructure and parts from China — the AI build-out flips the breakers. Tom’s Hardware. https://www.tomshardware.com/tech-industry/artificial-intelligence/half-of-planned-us-data-center-builds-have-been-delayed-or-canceled-growth-limited-by-shortages-of-power-infrastructure-and-parts-from-china-the-ai-build-out-flips-the-breakers ↩︎
Karpowicz, M. P. (2025). On the Fundamental Impossibility of Hallucination Control in Large Language Models. ArXiv.org. https://doi.org/10.48550/arXiv.2506.06382 ↩︎
Banerjee, S., Agarwal, A., & Singla, S. (2025). LLMs Will Always Hallucinate, and We Need to Live with This. Lecture Notes in Networks and Systems, 624–648. https://doi.org/10.1007/978-3-031-99965-9_39 ↩︎
Xu, Z., Jain, S., & Kankanhalli, M. (2024, January 22). Hallucination is Inevitable: An Innate Limitation of Large Language Models. ArXiv.org. https://doi.org/10.48550/arXiv.2401.11817 ↩︎
Hussain, S. (2025, July 28). Why is deterministic output from LLMs nearly impossible? Unstract.com →. https://unstract.com/blog/understanding-why-deterministic-output-from-llms-is-nearly-impossible/ ↩︎
Atil, B., Chittams, A., Fu, L., Ture, F., Xu, L., & Baldwin, B. (2024). LLM Stability: A detailed analysis with some surprises. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2408.04667
‌ ↩︎
Liu, N. F., Lin, K., Hewitt, J., Ashwin Paranjape, Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638 ↩︎
Du, Y., Tian, M., Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Galstyan, A., Wells, A., Schwartz, R., Huerta, E. A., & Peng, H. (2025). Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. 23281–23298. https://doi.org/10.18653/v1/2025.findings-emnlp.1264 ↩︎
Harada, K., Yamazaki, Y., Taniguchi, M., Marrese-Taylor, E., Kojima, T., Iwasawa, Y., & Matsuo, Y. (2025). When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following. ArXiv.org. https://doi.org/10.48550/arXiv.2509.21051 ↩︎
Kim, Y., Gu, K., Park, C., Park, C., Schmidgall, S., Ali, H. A., Yan, Y., Zhang, Z., Zhuang, Y., Malhotra, M., Liang, P. P., Park, H. W., Yang, Y., Xu, X., Du, Y., Patel, S., Althoff, T., McDuff, D., & Liu, X. (2025). Towards a Science of Scaling Agent Systems. ArXiv.org. https://arxiv.org/abs/2512.08296 ↩︎
Sajadi, A., Le, B., Nguyen, A., Damevski, K., & Chatterjee, P. (2025). Do LLMs consider security? an empirical study on responses to programming questions. Empirical Software Engineering, 30(3). https://doi.org/10.1007/s10664-025-10658-6 ↩︎
greggwirth. (2025, August 18). GenAI hallucinations are still pervasive in legal filings, but better lawyering is the cure – Thomson Reuters Institute. Thomson Reuters Institute. https://www.thomsonreuters.com/en-us/posts/technology/genai-hallucinations/ ↩︎
Kim, Y., Jeong, H., Chen, S., Li, S. S., Lu, M., Alhamoud, K., Mun, J., Grau, C., Jung, M., Gameiro, R., Fan, L., Park, E., Lin, T., Yoon, J., Yoon, W., Sap, M., Tsvetkov, Y., Liang, P., Xu, X., & Liu, X. (2025). Medical Hallucinations in Foundation Models and Their Impact on Healthcare. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2503.05777 ↩︎
Koenecke, A., Seo, A., Mei, K. X., Hilke Schellmann, & Sloane, M. (2024). Careless Whisper: Speech-to-Text Hallucination Harms. ArXiv (Cornell University). https://doi.org/10.1145/3630106.3658996 ↩︎
Yagoda, M. (2024, February 23). Airline held liable for its chatbot giving passenger bad advice – what this means for travellers. bbc.com. https://www.bbc.com/travel/article/20240222-air-canada-chatbot-misinformation-what-travellers-should-know ↩︎
Findings, R. (2025). Audience Use and Perceptions of AI Assistants for News. https://www.bbc.co.uk/aboutthebbc/documents/audience-use-and-perceptions-of-ai-assistants-for-news.pdf ↩︎
Linardon, J., Jarman, H. K., McClure, Z., Anderson, C., Liu, C., & Messer, M. (2025). Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study. JMIR Mental Health, 12, e80371–e80371. https://doi.org/10.2196/80371 ↩︎
Zhong, Z., Raghunathan, A., & Carlini, N. (2025). ImpossibleBench: Measuring LLMs’ Propensity of Exploiting Test Cases. ArXiv.org. https://doi.org/10.48550/arXiv.2510.20270 ↩︎
Morales, J. (2025, December 3). Google’s Agentic AI wipes user’s entire HDD without permission in catastrophic failure — cache wipe turns into mass deletion event as agent apologizes: “I am absolutely devastated to hear this. I cannot express how sorry I am.” Tom’s Hardware. https://www.tomshardware.com/tech-industry/artificial-intelligence/googles-agentic-ai-wipes-users-entire-hard-drive-without-permission-after-misinterpreting-instructions-to-clear-a-cache-i-am-deeply-deeply-sorry-this-is-a-critical-failure-on-my-part ↩︎
Ferreira, B. (2026, March 7). Claude Code deletes developers’ production setup, including its database and snapshots — 2.5 years of records were nuked in an instant. Tom’s Hardware. https://www.tomshardware.com/tech-industry/artificial-intelligence/claude-code-deletes-developers-production-setup-including-its-database-and-snapshots-2-5-years-of-records-were-nuked-in-an-instant ↩︎
System Card: Claude Opus 4 & Claude Sonnet 4. (2025). https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf ↩︎
Resnik, P. (2024). Large Language Models are Biased Because They Are Large Language Models. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2406.13138 ↩︎
Kumar, C. V., Urlana, A., Kanumolu, G., Garlapati, Bala Mallikarjunarao, & Mishra, P. (2025). No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language Models. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2503.11985 ↩︎
Taubenfeld, A., Dover, Y., Reichart, R., & Goldstein, A. (2024). Systematic Biases in LLM Simulations of Debates. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 251–267. https://doi.org/10.18653/v1/2024.emnlp-main.16 ↩︎
Hofmann, V., Kalluri, P. R., Jurafsky, D., & King, S. (2024). AI generates covertly racist decisions about people based on their dialect. Nature, 633(633), 1–8. https://doi.org/10.1038/s41586-024-07856-5 ↩︎
Nejadgholi, I., Molamohammadi, M., & Bakhtawar, S. (2024). Social and Ethical Risks Posed by General-Purpose LLMs for Settling Newcomers in Canada. ArXiv.org. https://arxiv.org/abs/2407.20240 ↩︎
Weber, L. (2025, June 23). Millions of Résumés Never Make It Past the Bots. One Man Is Trying to Find Out Why. WSJ; The Wall Street Journal (archived). https://archive.ph/4oxhV ↩︎
Krasniqi, R., Xu, D., & Vieira, M. (2025). SE Perspective on LLMs: Biases in Code Generation, Code Interpretability, and Code Security Risks. ACM Computing Surveys. https://doi.org/10.1145/3774324 ↩︎
Pearce, H., Ahmad, B., Benjamin Yong-Qiang Tan, Dolan-Gavitt, B., & Karri, R. (2021). Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. https://doi.org/10.48550/arxiv.2108.09293 ↩︎
Kharma, M. F., Choi, S., Alkhanafseh, M., & Mohaisen, D. (2026). Security and Quality in LLM-Generated Code: a Multi-Language, Multi-Model Analysis. IEEE Transactions on Dependable and Secure Computing, 1–15. https://doi.org/10.1109/tdsc.2026.3672745 ↩︎
Beck, K., Beedle, M., van Bennekum, A., Cockburn, A., Cunningham, W., Fowler, M., Grenning, J., Highsmith, J., Hunt, A., Jeffries, R., Kern, J., Marick, B., Martin, R. C., Mellor, S., Schwaber, K., Sutherland, J., & Thomas, D. (2001). Manifesto for agile software development. Agile Manifesto. https://agilemanifesto.org/ ↩︎
Cummings, M. (2004). Automation Bias in Intelligent Time Critical Decision Support Systems. AIAA 1st Intelligent Systems Technical Conference. https://doi.org/10.2514/6.2004-6313 ↩︎
Dou, S., Jia, H., Wu, S., Zheng, H., Zhou, W., Wu, M., Chai, M., Fan, J., Huang, C., Tao, Y., Liu, Y., Zhou, E., Zhang, M., Zhou, Y., Wu, Y., Zheng, R., Wen, M., Weng, R., Wang, J., & Cai, X. (2024). What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2407.06153 ↩︎
He, H., Miller, C., Agarwal, S., Kästner, C., & Vasilescu, B. (2025). Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects. ArXiv.org. https://doi.org/10.1145/3793302.3793349 ↩︎
Anwalt Jun. (2026, March 17). AI generated Code or Rewrites violate FOSS-Licences. Talk at FOSS Backstage Berlin. YouTube. https://www.youtube.com/watch?v=xvuiSgXfqc4 ↩︎
AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones – GitClear. (2025). Gitclear.com. https://www.gitclear.com/ai_assistant_code_quality_2025_research ↩︎
Curlee, R. (2025, November 5). The inevitable rise of poor code quality in AI-accelerated codebases. Sonarsource.com. https://www.sonarsource.com/blog/the-inevitable-rise-of-poor-code-quality-in-ai-accelerated-codebases ↩︎
Code Review at Cisco Systems The largest case study ever done on lightweight code review process; data and lessons. (n.d.). Retrieved March 29, 2026, from https://static0.smartbear.co/support/media/resources/cc/book/code-review-cisco-case-study.pdf ↩︎
Jureczko, M., Kajda, Ł., & Górecki, P. (2020). Code review effectiveness: an empirical study on selected factors influence. IET Software, 14(7), 794–805. https://doi.org/10.1049/iet-sen.2020.0134 ↩︎
Harvey, N., & DeBellis, D. (2024, October 22). Announcing the 2024 DORA report. Google Cloud Blog; Google Cloud. https://cloud.google.com/blog/products/devops-sre/announcing-the-2024-dora-report ↩︎
Hicklen, H. (2025, August 20). Blind Trust in AI: Most Devs Use AI-Generated Code They Don’t Understand. @Clutch_co; Clutch. https://clutch.co/resources/devs-use-ai-generated-code-they-dont-understand ↩︎
LinearB. (2023, August). LinearB Releases Guide to Help Development Teams Decrease Cycle Time by up to 47% Through the Adoption of “Continuous Merge” Practices. Prnewswire.com; Cision PR Newswire. https://www.prnewswire.com/news-releases/linearb-releases-guide-to-help-development-teams-decrease-cycle-time-by-up-to-47-through-the-adoption-of-continuous-merge-practices-301889896.html ↩︎
The State of Developer Ecosystem in 2023 Infographic. (n.d.). JetBrains: Developer Tools for Professionals and Teams. https://www.jetbrains.com/lp/devecosystem-2023/ ↩︎
Carey, S. (2025, July 29). The Engineering Leadership Report 2025. LeadDev. https://leaddev.com/the-engineering-leadership-report-2025 ↩︎
From Tools to Teammates: Navigating the New Human-AI Relationship | Upwork. (2025). Upwork.com. https://www.upwork.com/research/navigating-human-ai-relationships ↩︎
Nataliya Kosmyna, Hauptmann, E., Yuan, Y. T., Situ, J., & Maes, P. (2025, June 10). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. ResearchGate. https://doi.org/10.48550/arXiv.2506.08872 ↩︎
Gerlich, M. (2025). AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking. SSRN Electronic Journal, 15(1). https://doi.org/10.2139/ssrn.5082524 ↩︎
Welcome To Zscaler Directory Authentication. (2026). Snyk.io. https://labs.snyk.io/resources/copilot-amplifies-insecure-codebases-by-replicating-vulnerabilities ↩︎
Gregor Ojstersek. (2025, June 4). Decrease in Entry-Level Tech Jobs. Eng-Leadership.com; Engineering Leadership. https://newsletter.eng-leadership.com/p/decrease-in-entry-level-tech-jobs ↩︎
AI vs Gen Z: How AI has changed the career pathway for junior developers – Stack Overflow. (2025, December 26). Stackoverflow.blog. https://stackoverflow.blog/2025/12/26/ai-vs-gen-z ↩︎
Hwang, M. (2026, March 4). No Juniors Means No Seniors: The Cost Of Replacing Developers With AI. Forbes. https://www.forbes.com/councils/forbesagencycouncil/2026/03/04/no-juniors-today-no-seniors-tomorrow-the-cost-of-replacing-developers-with-ai ↩︎
Chen, S., Gao, M., Sasse, K., Hartvigsen, T., Anthony, B., Fan, L., Aerts, H., Gallifant, J., & Bitterman, D. S. (2025). When helpfulness backfires: LLMs and the risk of false medical information due to sycophantic behavior. Npj Digital Medicine, 8(1), 1–9. https://doi.org/10.1038/s41746-025-02008-z ↩︎
Kearney, M., Binns, R., & Gal, Y. (2025). Language Models Change Facts Based on the Way You Talk. ArXiv.org. https://doi.org/10.48550/arXiv.2507.14238 ↩︎
Lazarus, L. M. (2025, March 12). Exclusive: CEOs are turning to AI for business advice and they trust it even more than their friends and peers. Fortune. https://fortune.com/2025/03/12/ceos-asking-ai-business-advice-trust-more-friends-peers-study ↩︎
Quick Thinking 2.0 – Balancing AI, instinct, and insight. (2026). Confluent. https://www.confluent.io/resources/report/quick-thinking-2026 ↩︎
Cheng, M., Lee, C., Khadpe, P., Yu, S., Han, D., & Jurafsky, D. (2026). Sycophantic AI decreases prosocial intentions and promotes dependence. Science, 391(6792). https://doi.org/10.1126/science.aec8352 ↩︎
Xia, X., Bao, L., Lo, D., Xing, Z., Hassan, A. E., & Li, S. (2018). Measuring Program Comprehension: A Large-Scale Field Study with Professionals. 44(10), 951–976. https://doi.org/10.1109/tse.2017.2734091 ↩︎
Coronado-Blázquez, J. (2025). Deterministic or probabilistic? The psychology of LLMs as random number generators. ArXiv.org. https://doi.org/10.48550/arXiv.2502.19965 ↩︎
Vibe Password Generation: Predictable by Design – Irregular. (2026, February 18). Irregular.com. https://www.irregular.com/publications/vibe-password-generation ↩︎
Armstrong, T. E. (2025). Slopsquatting assurance. EDPACS, 1–9. https://doi.org/10.1080/07366981.2025.2510097 ↩︎
Dora, S., Lunkad, D., Aslam, N., Venkatesan, S., & Shukla, S. K. (2025). The Hidden Risks of LLM-Generated Web Application Code: A Security-Centric Evaluation of Code Generation Capabilities in Large Language Models. Lecture Notes in Computer Science, 27–37. https://doi.org/10.1007/978-3-032-13714-2_3 ↩︎
LLM01: Prompt Injection. (n.d.). OWASP Top 10 for LLM & Generative AI Security. https://genai.owasp.org/llmrisk/llm01-prompt-injection ↩︎
Mayraz, O. (2025, October 8). CamoLeak: Critical GitHub Copilot Vulnerability Leaks Private Source Code. Legitsecurity.com; Legit Security. https://www.legitsecurity.com/blog/camoleak-critical-github-copilot-vulnerability-leaks-private-source-code ↩︎
Lee, D., & Tiwari, M. (2024). Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems. ArXiv.org. https://doi.org/10.48550/arXiv.2410.07283 ↩︎
Brittain, B. (2026, March 2). US Supreme Court declines to hear dispute over copyrights for AI-generated material. Reuters. https://www.reuters.com/legal/government/us-supreme-court-declines-hear-dispute-over-copyrights-ai-generated-material-2026-03-02/ ↩︎
Harris, L., & Criddle, C. (2025, November 23). Insurers retreat from AI cover as risk of multibillion-dollar claims mounts. Financial Times (archived). https://archive.is/ZfWyd ↩︎
U.S. Equal Employment Opportunity Commission. (n.d.). Www.eeoc.gov. https://www.eeoc.gov/meetings/meeting-january-31-2023-navigating-employment-discrimination-ai-and-automated-systems-new/transcript ↩︎
Anwalt Jun. (2026, March 17). AI generated Code or Rewrites violate FOSS-Licences. Talk at FOSS Backstage Berlin. YouTube. https://www.youtube.com/watch?v=xvuiSgXfqc4 ↩︎
Hart, R. (2026, February 20). Amazon blames human employees for an AI coding agent’s mistake. The Verge. https://www.theverge.com/ai-artificial-intelligence/882005/amazon-blames-human-employees-for-an-ai-coding-agents-mistake ↩︎
Dupré, M. H. (2026, March 3). Ars Technica Fires Reporter After AI Controversy Involving Fabricated Quotes. Futurism. https://futurism.com/artificial-intelligence/ars-technica-fires-reporter-ai-quotes ↩︎
Morales, J. (2026, April 3). Microsoft says Copilot is for entertainment purposes only, not serious use — firm pushing AI hard to consumers and businesses tells users not to rely on it for important advice. Tom’s Hardware. https://www.tomshardware.com/tech-industry/artificial-intelligence/microsoft-says-copilot-is-for-entertainment-purposes-only-not-serious-use-firm-pushing-ai-hard-to-consumers-tells-users-not-to-rely-on-it-for-important-advice ↩︎
Zhao, C., Deng, C., Ruan, C., Dai, D., Gao, H., Li, J., Zhang, L., Huang, P., Zhou, S., Ma, S., Liang, W., He, Y., Wang, Y., Liu, Y., & Wei, Y. X. (2025). Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures. Proceedings of the 52nd Annual International Symposium on Computer Architecture, 1731–1745. https://doi.org/10.1145/3695053.3731412 ↩︎
GPU Requirements Guide for DeepSeek Models (V3, All Variants). (2025). Apxml.com. https://apxml.com/posts/system-requirements-deepseek-models ↩︎
Rogé Karma. (2025, December 10). The Atlantic. The Atlantic; theatlantic. https://www.theatlantic.com/economy/2025/12/nvidia-ai-financing-deals/685197/ ↩︎
Edwards, J. (2025, October 3). Spending on AI is increasingly fueled by debt, Goldman Sachs says. Yahoo Finance. https://finance.yahoo.com/news/spending-ai-increasingly-fueled-debt-115949675.html ↩︎
Gao, M., Li, Y., Liu, B., Yu, Y., Wang, P., Lin, C.-Y., & Lai, F. (2025). Single-agent or Multi-agent Systems? Why Not Both? ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2505.18286 ↩︎
Becker, J., Rush, N., Barnes, E., & Rein, D. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. ArXiv.org. https://doi.org/10.48550/arXiv.2507.09089 ↩︎
Karamolegkou, A., Li, J., Zhou, L., & Søgaard, A. (2023, October 20). Copyright Violations and Large Language Models. ArXiv.org. https://doi.org/10.48550/arXiv.2310.13771 ↩︎
Dade, & Hossain, R. M. (2025). Litespark Technical Report: High-Throughput, Energy-Efficient LLM Training Framework. ArXiv.org. https://doi.org/10.48550/arXiv.2510.02483 ↩︎
Strubell, E., Ganesh, A., & McCallum, A. (2020). Energy and Policy Considerations for Modern Deep Learning Research. Proceedings of the AAAI Conference on Artificial Intelligence, 34(09), 13693–13696. https://doi.org/10.1609/aaai.v34i09.7123 ↩︎
Li, P., Yang, J., Islam, M. A., & Ren, S. (2025). Making AI Less “Thirsty”: Uncovering and Addressing the Secret Water Footprint of AI Models. ArXiv (Cornell University). https://doi.org/10.48550/arxiv.2304.03271 ↩︎
Nidhal Jegham, Marwen Abdelatti, Lassad Elmoubarki, & Abdeltawab Hendawi. (2025, May 14). How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference. https://doi.org/10.48550/arXiv.2505.09598 ↩︎
Allen, M. (2025, July 21). Altman plans D.C. push to “democratize” AI economic benefits. Axios. https://www.axios.com/2025/07/21/sam-altman-openai-trump-dc-fed ↩︎
Yañez-Barnuevo, M. (2025, June 25). Data Centers and Water Consumption | Article | EESI. Eesi.org. https://www.eesi.org/articles/view/data-centers-and-water-consumption ↩︎
Nicoletti, L., Ma, M., & Bass, D. (2025, May 8). How AI Demand Is Draining Local Water Supplies. Bloomberg.com; Bloomberg. https://www.bloomberg.com/graphics/2025-ai-impacts-data-centers-water-data ↩︎
Nolan, D. (2025, August 24). Meta is sinking $10 billion into rural Louisiana to build the home of its wildest AI aspirations, setting the template for the nation’s grid buildout. Fortune. https://fortune.com/2025/08/24/meta-data-center-rural-louisiana-framework-ai-power-boom ↩︎
Iacurci, G. (2025, November 26). AI data center “frenzy” is pushing up your electric bill — here’s why. CNBC. https://www.cnbc.com/2025/11/26/ai-data-center-frenzy-is-pushing-up-your-electric-bill-heres-why.html ↩︎
Reuters Staff. (2026, February 25). Companies cutting jobs as investments shift toward AI. Reuters. https://www.reuters.com/business/world-at-work/companies-cutting-jobs-investments-shift-toward-ai-2026-03-19 ↩︎
Griffiths, B. D. (2026, April 2). US tech layoffs are at worst point since 2023. AI is driving the surge. Business Insider. https://www.businessinsider.com/tech-layoffs-q1-march-data-ai-impact-2026-4 ↩︎
Roeloffs, M. W. (2026, April 2). Companies Cut 60,000 Jobs In March—And AI Is Largely To Blame. Forbes. https://www.forbes.com/sites/maryroeloffs/2026/04/02/ai-blamed-heavily-for-march-job-cuts-report-says ↩︎
Hays, K. (2026, March 29). Why tech CEOs suddenly love blaming AI for mass layoffs. https://www.bbc.com/news/articles/cde5y2x51y8o ↩︎
Curtis, L. (2025, November 28). AI Slop Report: The Global Rise of Low-Quality AI Videos. Kapwing Company Blog. https://www.kapwing.com/blog/ai-slop-report-the-global-rise-of-low-quality-ai-videos ↩︎
Chow, A. R. (2026, March 27). AI Slop Is Flooding Streaming—and Musicians Are Fighting Back. TIME; Time. https://time.com/article/2026/03/26/ai-slop-is-threatening-musicians-can-tech-companies-stem-the-tide- ↩︎
Over ½ of Long Posts on LinkedIn are Likely AI-Generated Since ChatGPT Launched – Originality.AI. (2024). Originality.ai. https://originality.ai/blog/ai-content-published-linkedin ↩︎
Fraiman, M. (2026, January 28). 77% of all “Success” Self-Help Books on Amazon are Likely AI. Originality.ai; Originality.AI. https://originality.ai/blog/likely-ai-success-self-help-book-study ↩︎
Amarotti, A. (2025, November 27). What the Rise of AI Slop Means for Marketers. Meltwater. https://www.meltwater.com/en/blog/ai-slop-consumer-sentiment-social-listening-analysis ↩︎
Press, T. A. (2026, March 6). Pentagon labels AI company Anthropic a supply chain risk. NPR. https://www.npr.org/2026/03/06/g-s1-112713/pentagon-labels-ai-company-anthropic-a-supply-chain-risk ↩︎
Barnes, J. E., Schmitt, E., Pager, T., Browne, M., & Cooper, H. (2026, March 11). U.S. at Fault in Strike on School in Iran, Preliminary Inquiry Says. The New York Times. https://www.nytimes.com/2026/03/11/us/politics/iran-school-missile-strike.html ↩︎

Hallucinating progress: the risks of uncritical LLM adoption in software development

LLM vs AI

Deflating the hype

Fundamental limitations

Hallucinations are an inevitable property of LLMs

LLMs are non-deterministic

LLMs fail with complex problems

LLMs confidently produce false information (and perform acts of de facto sabotage)

LLMs inherit bias from their training data

An LLM will never have the full context

Codebase degradation

LLMs can produce plausible but incorrect code

LLMs often generate overly complex solutions

LLM usage encourages generating boilerplate instead of reducing it

LLMs make mistakes faster than humans

Developers stop understanding their own codebase

Human impact

Reviewing large amounts of AI-generated code is exhausting and may quickly lead to burnout

Engineering skills erode over time

Overreliance on LLMs may lead to a shortage of senior developers

Loss of quality gates

An LLM won’t say no

LLMs can reinforce misconceptions and lead to bad decisions

Friction is a feature

Security risks

LLMs can produce insecure code

LLM agents are vulnerable to prompt injection

Cloud LLMs put sensitive data at risk

Business risk

AI-generated code carries legal risk

A machine cannot be held accountable

AI assistance creates a dependency on external tools

LLM costs are likely to rise

LLMs might make you slower

Mandating LLM use may turn away talent, as well as customers

Conclusion

Related:

Search the entire website:

Be First to Comment

Leave a Reply Cancel reply

Hallucinating progress: the risks of uncritical LLM adoption in software development

LLM vs AI

Deflating the hype

Fundamental limitations

Hallucinations are an inevitable property of LLMs

LLMs are non-deterministic

LLMs fail with complex problems

LLMs confidently produce false information (and perform acts of de facto sabotage)

LLMs inherit bias from their training data

An LLM will never have the full context

Codebase degradation

LLMs can produce plausible but incorrect code

LLMs often generate overly complex solutions

LLM usage encourages generating boilerplate instead of reducing it

LLMs make mistakes faster than humans

Developers stop understanding their own codebase

Human impact

Reviewing large amounts of AI-generated code is exhausting and may quickly lead to burnout

Engineering skills erode over time

Overreliance on LLMs may lead to a shortage of senior developers

Loss of quality gates

An LLM won’t say no

LLMs can reinforce misconceptions and lead to bad decisions

Friction is a feature

Security risks

LLMs can produce insecure code

LLM agents are vulnerable to prompt injection

Cloud LLMs put sensitive data at risk

Business risk

AI-generated code carries legal risk

A machine cannot be held accountable

AI assistance creates a dependency on external tools

LLM costs are likely to rise

LLMs might make you slower

Mandating LLM use may turn away talent, as well as customers

Conclusion

Share this:

Related:

Search the entire website:

Be First to Comment

Leave a Reply Cancel reply