Challenges of AI-Driven Code Generation

Introduction: The Dream of Autonomous Programming

Since the release of the first version of Devin over a year ago, rumors quickly spread that software development was over—or rather, that software developers were. The idea that programming would be automated through the use of LLMs and programming agents became widespread. At the same time, Jensen Huang, CEO of Nvidia, recommended that children should no longer learn programming. Leading companies have also expressed their interest in moving away from the need for developers; Salesforce announced that it will not be hiring developers in 2025.

As someone passionate about automation and eager to accelerate my own capabilities as a developer, I have spent the last six months investigating and working on AI-generated code. During this time, some colleagues and I have been developing a solution that reflects our vision for the future of programming. This has given us extensive experience in code generation, and we have observed how major companies, like Microsoft with GitHub Copilot, align with our ideas.

This post takes you through the technical and conceptual challenges we have faced in building this AI code generation tool. If you use tools like Cursor, Bolt.new, Microsoft Copilot, Aider, SuperMaven, or any other similar tool, after reading this post, you will gain a deeper understanding of why these tools behave the way they do, enabling you to use them more effectively.

Motivation and Vision for Code Generation and Programming Agents

Like many, I want more time, or to achieve more in less time, or even better, to accomplish a lot in an almost imperceptible amount of time. I believed that generating code in a 100% automatic way was impossible until I discovered WebSim.ai about eight months ago. For the first time, I saw a platform capable of generating websites using only natural language as an interface. I thought, if a website can create other websites automatically, then I could have a developer that programs for me to help with my personal projects. This was the initial spark that led me to create Pluscoder.

My vision is based on the following principles:

We want to remove the human from the equation of code generation.
There is implicit knowledge/learning/context that humans have, and machines do not (yet). For example, in a fast-paced startup, developers may prioritize quick results over perfectly optimized code.
LLMs in the future will be of much higher quality, extremely inexpensive, fast, and have longer context windows.
There will always be necessary restrictions and guidelines for correctly proposing code solutions.

This can be summed up as: "in the future, we will be able to create and maintain specialized software 100% autonomously and at a low cost." How true is this statement today, in February 2025?

Challenges: The Dream Versus Reality in Code Generation

The idea of generating code autonomously is fascinating, but the reality is that this field is just beginning, with a long way to go. Over the past few months, we have encountered a series of technical and conceptual challenges that highlight the current limitations of programming agents. These challenges reveal the inherent difficulties of using AI for code generation, at least as of today.

Context

We must be able to provide adequate context to an LLM or Agent so it can generate solutions based on our expectations. This context may include the repository (or parts of it), documentation, images or diagrams, external repositories, libraries, guidelines/standards, issues, commit history, etc.

Selecting the Right Context:

Ideally, we only want to provide the relevant context to solve a problem. But how do we determine what is relevant? How do we ensure that everything needed is included?

Typically, the user selects the context, specifying a set of repository files, for example. It can also be automated using search algorithms, or a mix of both. We tested different RAG algorithms, also semantic search with BM25. The best results came from IterRag+BM25 (Iterative RAG), but all of this is constantly evolving.

Using our own AI Tool, as well as market solutions like GitHub Copilot and Cursor, we encountered situations where the AI could not load all the necessary context to solve the problem. In such cases, the user must iterate with the AI to refine the context or settle for an imprecise response due to missing context. But what if the user lacks the knowledge to correct the AI?

On the other hand, when too much context is provided, we encounter the well-known needle-in-a-haystack problem, where models may perform worse when given large amounts of context while trying to retrieve specific information.

Behavior in Long Interactions:

One of the most common challenges with AI agents is the loss of expected behavior over long interactions. Initially, clear instructions can be set, such as asking the agent to generate and validate a step-by-step plan before writing code. This behavior is usually followed in the first messages but tends to deteriorate as the conversation progresses.

Not only do user instructions fade, but system instructions can also diminish. This is why reminders are often included in user messages, either automatically without the user knowing or manually by the user when they notice the AI is not behaving as expected.

This behavior is intrinsic to LLMs and will be present in most solutions. Solving this problem requires finding ways to reinforce initial instructions or implementing mechanisms to track and prioritize the "state" of the conversation, such as using dynamic instructions or defined workflows to give more control to an interaction.

This challenge underscores the importance of designing effective interactions and tools that minimize context loss in extended interactions, ensuring that the AI remains aligned with the user's expectations.

Implicit Knowledge and Context:

A human developer has access to knowledge that goes beyond the code in the repository, such as change history and past decisions. For example, a developer may remember that a file was modified in a previous sprint to optimize performance, whereas an AI, lacking this context, might propose changes that undo that optimization.

Additionally, developers understand design and architectural aspects that are not always documented. This includes the project's vision, client objectives, and user expectations. An AI may be unaware of which libraries are standard within the team or how certain decisions align with the broader product strategy.

The absence of this implicit knowledge in AI presents major challenges. Incorporating elements like architectural documentation and decision history can mitigate this problem, but much of this information still relies on human experience and context, making full automation difficult.

Lack of Temporal Memory:

A human developer retains long- and medium-term details about project changes, including what was modified, when, and why, allowing them to understand the general direction of design or development. For example, a human may know that certain files were recently refactored to improve efficiency as part of a larger plan for new features.

In contrast, AI lacks such temporal memory. To compensate, additional context could be provided, such as all previous versions of modified files, the most relevant commit diffs, or even a summary of key design events. However, including too much information could overload the AI, while providing too little may lead it to generate solutions that contradict past decisions.

Failing to provide this information can have significant consequences, such as introducing inconsistencies in the code or accidentally undoing recent improvements. Therefore, striking a balance in the amount and quality of context provided is crucial, prioritizing elements like significant change history and key architectural guidelines to help AI maintain a coherent vision of the project.

Managing Change History:

Providing the entire history of changes to the LLM preserves the direction of past adjustments, but this quickly saturates the context in projects with large files or frequent modifications. Additionally, keeping multiple obsolete versions in the conversation history occasionally causes AI models to generate changes based on outdated code rather than the most recent version.

As an alternative, the AI can be given only the latest version of modified files. While efficient in terms of token usage and avoiding ambiguity caused by outdated code, this approach loses temporal context and change tracking, potentially leading to solutions disconnected from the project's evolution.

An intermediate approach would be to include only the most relevant diffs instead of full files. However, this method can still present problems, as the AI might suggest changes based on outdated versions. This issue becomes even more complex when working with agents that delegate tasks to each other, amplifying errors if the history is not managed properly.

Instructions and Constraints

Using the right context is only the first step in obtaining responses aligned with our expectations. The instructions we give AI agents and the expected output based on the provided context determine the true value of the results.

Instruction Saturation:

There are several recommended structures for system prompts, and various strategies suggested by the AI providers themselves. For example, Anthropic has its own Prompt Engineering Guide, which recommends including a role, main instructions, output format, preconditions, example responses, tone or style, among others. Each of these elements is a separate instruction, and typically, we expect the AI to follow all of them simultaneously to produce a satisfactory response.

Generating code with an LLM involves multiple layers of instructions: those defined at the system level and those provided by the user during interaction. System instructions, such as "follow a specific coding standard" or "define a plan before writing changes," serve as a foundation to guide the model's behavior. However, these instructions can lose their impact as additional guidelines are added throughout the interaction or within the system prompt itself.

From the user’s perspective, every new request or adjustment adds another layer of complexity. For example, asking the AI to follow security guidelines, use a specific documentation style, and integrate with existing systems may result in the AI partially following some instructions while ignoring others. This saturation occurs because LLMs prioritize the most recent instructions, reducing the effectiveness of earlier system or user-defined instructions.

This challenge is amplified in complex tasks, where instruction precision and context clarity are essential. An excess of directives can confuse the model, leading to incomplete or inconsistent solutions. Designing structured workflows with hierarchical and prioritized instructions is key to reducing saturation and ensuring that generated responses remain coherent and aligned with objectives.

To achieve precision regarding the given instructions or constraints and align the expected responses, the few-shot examples technique is often used. However, depending on how it is applied, this technique can reduce the flexibility of models or agents, as seen below.

Flexibility vs. Specialization:

Providing overly specific instructions to an AI can limit its ability to adapt to different tasks or contexts. For instance, if we design an agent exclusively specialized in writing code, it may struggle with complementary tasks such as creating an implementation plan, designing an architecture diagram, or conducting a high-level brainstorming that does not directly involve generating code. This level of specialization can reduce its usefulness in broader scenarios.

The key is to find the right balance between flexibility and specialization according to the user's needs. Some profiles, such as senior developers, may expect the AI to be adaptable and assist with conceptual tasks, while junior users might prefer it to focus solely on generating and explaining code. A rigid design might only cater to certain user profiles, leading to dissatisfaction, frustration, or limited adoption of the tool.

Additionally, use cases influence this balance. While in highly technical environments, clear specialization may be preferred, other scenarios might require the AI to flexibly handle different levels of abstraction. Designing agents with intentional adaptability and allowing for specific configurations based on the user profile is essential to maximizing their effectiveness and acceptance.

Instruction Complexity:

When the instructions given to an agent are too complex or consist of multiple steps, it is common for only the first steps to be completed. This issue becomes more pronounced when each step requires an extensive response from the LLM, resulting in a partial response that leaves tasks unaddressed.

To mitigate this problem, some strategies involve structuring instructions as a clear task list executed in a sequential manner. Each task is delegated and completed according to a predefined criterion before moving on to the next. This approach, combined with a human-in-the-loop model, can be highly effective. However, it raises a new question: who is responsible for breaking down complex instructions into manageable steps? This process may fall on users, who must also validate the AI’s suggested tasks, creating a dependency that moves us away from the ideal of fully autonomous programming.

In some contexts, such as technical debt management, the step-by-step process can be known in advance thanks to prior design. This allows agents to execute repetitive tasks with greater precision, especially when applied to multiple repositories. While this approach is useful, it highlights the importance of clear planning and structuring instructions to maximize the effectiveness of AI-based tools.

Ambiguity in Instructions:

One of the biggest challenges when working with AI agents is the inherent ambiguity in instructions. This issue is closely related to the cognitive bias known as the "curse of knowledge", where we assume our interlocutor has the same context as we do. Without this context, the AI may interpret instructions in unexpected ways, generating results that, while functional, do not align with our expectations.

A useful analogy is giving the same instruction to two junior developers and asking them to solve a problem independently. They would likely arrive at different solutions, both technically correct but with distinct approaches. Now, if we give the same request to a senior developer, we would expect a more optimized solution aligned with specific standards. To avoid ambiguity with AI, we must provide a level of detail equivalent to the implicit knowledge a senior developer would have, including preferred libraries, design patterns, and team standards.

User Profiles:

The effectiveness of AI in generating solutions is directly influenced by the user profile utilizing it. Users with technical experience, such as developers familiar with the repository or with a clear vision of the project's direction, can provide more precise instructions. These users can suggest specific files, relevant methods, or even implementation strategies, minimizing the need for extensive iterations and improving response accuracy.

Conversely, users with less context or technical knowledge tend to give high-level instructions, requiring the AI to take a more proactive role in interpreting and executing tasks. In these cases, it is crucial for the AI to follow internal guides and predefined standards, even if the user does not explicitly mention them. Designing agents that adapt to different experience levels and complement users’ limitations is key to ensuring an effective and satisfactory experience.

Conclusion

This post is just a fraction of a broader analysis of the challenges of AI programming. In future posts, I will cover challenges related to the results of AI-generated code, validation & testing, workflows & orchestration, and standardization for development teams.

From what we have discussed so far, it is evident that we are far from fully autonomous programming, at least for end-to-end solutions. The current limitations in context handling and the need for high-quality instructions suggest that active collaboration between humans and AI remains crucial to making the most of these tools and moving toward that autonomous ideal.

Subscribe

Challenges of AI-Driven Code Generation

Introduction: The Dream of Autonomous Programming

Motivation and Vision for Code Generation and Programming Agents

Challenges: The Dream Versus Reality in Code Generation

Context

Instructions and Constraints

Conclusion

Read Next

Stop Guessing: Design Experiments That Learn (Not Just Validate)

Real-Time Expense Tracking Without Open Banking: Solving Personal Finance in Chile

The New Era of AI-Driven Musical Composition

Challenges of AI-Driven Code Generation

Introduction: The Dream of Autonomous Programming

Motivation and Vision for Code Generation and Programming Agents

Challenges: The Dream Versus Reality in Code Generation

Context

Instructions and Constraints

Conclusion

Plasticity Direct to Your Inbox

Read Next

Stop Guessing: Design Experiments That Learn (Not Just Validate)

Real-Time Expense Tracking Without Open Banking: Solving Personal Finance in Chile

The New Era of AI-Driven Musical Composition