Get Phillip’s O’Reilly book: Observability for Large Language Models.
I first started using AI coding assistants in early 2021, with an invite code from a friend who worked on the original GitHub Copilot team. Back then, the workflow was just single-line tab completion, but you could also guide code generation with comments and it’d try its best to implement what you want.
Fast forward to 2025. There’s now a wide range of coding assistants that are packed with features. The models have gotten substantially more powerful, and the way I develop with them has changed too. They’re an essential part of coding for me in various contexts.
However, many developers remain skeptical of the utility of AI coding assistants. This is usually because they tried a vague task with a free AI model in the past and noticed incorrect code, hallucinated API calls, or another issue. Others have incorporated these tools, but their use hasn’t resulted in better software. In some cases, the tools have lowered overall productivity because teams had to hunt down bugs where the root cause involved a developer who blindly trusted AI-generated code.
I wrote this post for those who are skeptical or who came away unimpressed. I can’t promise that if you follow what I say you’ll fall in love with AI coding assistants, but I do believe that if you adopt some of the following tips, you’ll come away substantially more impressed than you might be today.
But first: AI assistants are tools, not magic. If there’s one thing to take away from my post, it’s that AI is a tool that requires you to develop skills to wield effectively. If you do not invest in these skills, you will be ineffective using AI for coding.
Use Claude—and pay for it
The first step is to get a good tool. Right now, the best AI model for coding is Claude. Claude has this je ne sais quoi where it doesn’t veer off course much as you use it, the code it writes for your codebase seems to fit the style of your code, and it doesn’t make up API calls that don’t exist.
Do not take this recommendation lightly. Claude is the only model that you can reliably not secondguess every time you use it. I found that it can still make mistakes, but more often than not, these mistakes happened because I didn’t provide enough context.
If you formed your opinion about AI coding assistants from using free ChatGPT a year ago, you need to radically update your priors. The gap between free AI tools and premium models like Claude has widened substantially. Maybe at some point, another model will dethrone Claude for coding. But for now, it’s the one to use.
Since Claude is not just a chatbot but a model, it can be used to power any number of coding assistants. I personally don’t use these tools much and usually just copy/paste from the Claude web interface, but little that I’ll describe in this post prohibits the use of other tools.
The kind of code you write matters
AI coding assistants vary wildly in their effectiveness based on the kind of code you’re writing. Generally speaking, I think about this in three ways:
Task commonality
The more common the kind of code you write, the more likely an AI model will do a good job. For example, if you’re writing a Next.js app with Tailwind CSS, most modern models will not only be up to date on the exact patterns that work best, but they’ll also work with older versions too. But if you’re writing code in a custom backend with significant domain-specific constraints, you’ll probably find that the AI model struggles to generate code that fits your needs.
This doesn’t mean you can’t be effective with AI and domain-specific constraints. It just means you’ll need to be more explicit about what you want.
Likelihood that similar code is widely available online
If the code you’re writing for your domain has a lot of similar code available online, chances are an AI model will write good code for you without much work.
Web development, relatively boring backend development, mobile development, library development, and “CNCF-adjacent” cloud development (i.e., written in Go and deployed to Kubernetes) are some domains that I’ve found AI models do well with.
Rapid feedback cycles
The easier it is to run and verify code, the more effective AI assistance becomes. This is why tasks with quick feedback loops (like frontend development or unit test writing) tend to work particularly well, but tasks with slower feedback loops (like non-Kubernetes infrastructure code) can be more challenging.
Rethink the decision to use libraries
One unexpected lesson I learned is that because AI makes it very cheap to generate code, I think differently about if I want to use a library or not.
I view libraries as falling into one of two categories:
- Libraries that solve genuinely hard problems for me
- Libraries that save me from writing a bunch of code that I don’t want to write
For the first category, I will obviously use a library. There are many libraries that are rigorously tested, have a large community, and are well-maintained. And in fact, AI models like Claude will often recommend using good libraries in the first place.
Sometimes, if a model recommends a library, I’ll interrogate the suggestion to see if it’s really needed, or if it requires a lot of code to replicate. In general, it’s a good idea to “question” an AI model’s suggestion. The first suggestion you get may not always be better. But importantly, asking an AI model to explain its answer in a critical light often leads to a better suggestion.
All that said, AI has made me completely re-evaluate the second category—where a library is brought in because it saves some code and adds convenience. Because code is so cheap to generate now, I will usually generate the code and tests for it, package it up, and have one less hellish dependency to manage over time. Dependency management is much harder than re-generating code.
A different development loop
With AI assistance, my development loop is more dynamic and iterative than the before times, and requires several techniques that I’ve learned over time to be effective.
Build durable context
RAG is something most people building with AI have learned is an essential pattern, and the quality of an AI integration is a direct function of the relevancy, magnitude, and quality of the contextual data you pass to an AI model. This is especially true for using AI for real-world coding instead of tiny little side projects.
Use projects/rules or paste context into a chat session
I usually start by creating an instruction/rules file or project in the AI tool I’m using. In Claude or ChatGPT, this is literally called “Projects,” and these are a way to provide a bunch of files that make the model produce significantly better code. Cursor and Copilot support custom instructions in various ways too.
Note that depending on your tool, it’s not just written text files that are supported, but also diagrams. I’ve found that diagrams are particularly useful for showing things like architecture or general flows of a working app to contextualize how the code makes something work. You can also provide custom documents that reference other documents, such as a file that explains the names of several screenshots and how they relate to the codebase. Things like Claude Projects will pick this up.
If you’re not using a tool that supports this sort of thing, you can get by with pasting in a single file—preferably following the llms.txt standard—and your single chat session will produce shockingly better code.
Just writing some custom rules or instructions or pasting in an llms.txt file will make a huge difference in the quality of the code you get back. However, you can take this a lot further by cross-referencing important context with the structure (and substructure) of your codebase. Although tools like Cursor and Copilot do support codebase indexing, I’ve found that this only helps with finding likely-relevant code for a task and increasing the likelihood that the generated code can compile—it doesn’t embed actual knowledge of the codebase. To make sure it doesn’t “go off the rails,” you need to provide that context yourself.
Sharing existing code structure and important context
I typically start by dumping the output of tre to a text file and including that with some basic instructions about code areas.
For larger projects, I spend some time writing up a proper llms.txt markdown file that describes the high-level goals of the system, key areas of the codebase with important knowledge, and then the full file structure. This is a bit of work, but it’s a one-time cost that pays off in the long run.
Get the LLM to update its own context
As you iterate, you’ll find that the LLM environment needs more pushing to “get things right” with respect to the context you’ve provided. Usually, it will make an assumption about a module or file that isn’t correct. The solution is very simple: program your development environment.
Think of the LLM as a wobbly and weird kind of computer that’s bad (or at least inefficient) at tasks like adding numbers together, but extremely good at being directionally accurate about things. You can guide its force vector towards the particular outcome you want.
Given this, when you want it to have updated context, you can simply ask it to update its own context once you know it got something right. For example:
- If you generated a new Next.js component that does an important thing and it works correctly, ask the LLM to update its context with the new file and a brief description of the purpose it serves.
- If you added a new kind of protocol support for a library, ask the LLM to update internal documentation about protocol support.
- If you shuffle data off to a new AWS service, ask the LLM to update its description of the data flow and AWS services used, and for which purpose.
You will still need to review things, but as you iterate and continually have better and more up-to-date context, you’ll find that you don’t need to correct it that often anymore.
Ask for small code changes, not big ones
This is no different than the general advice you find online about software engineering regarding small diffs, small deployments, etc.
AI models are highly complex systems that often get things wrong when you ask them to do too much at once. The solution is to simply have them do less. Some examples:
- Don’t generate an entire website at once. Generate a single component, then another, then another. Run your app and run tests each time.
- Don’t generate an entire API at once. Generate a single endpoint, work on the conventions that fit best, adjust your data model incrementally, etc.
- Don’t generate an internal library of utility functions at once.
- Don’t generate an entire new feature at once. Start with a task description, then iterate on a spec, then generate some code for the spec, then generate tests, run the tests, generate more code for the spec, etc.
These guidelines are not hard and fast, but also hold true with agents in my experience. While an agent can do a better job of defining an internal library of utility functions given a desired goal and constraints (or just a spec), I’ve found that agents simply don’t work well with existing codebases and nontrivial tasks. Right now, they seem especially tuned for generating brand-new codebases from scratch, and simple ones at that.
For agents, my intuition so far is that you need to establish firm guardrails around their scope. Think of an agent as helping implement a small ticket end-to-end, with you dictating how it should clarify requirements, test code, and perform the iterative loop. If I don’t do this, the agent will wander off into the wilderness and eventually generate nonsensical, often uncompilable code. You still have to think. Sorry!
Where it’s going from here
I’m genuinely looking forward to revisiting this blog post in a year now that LLMs have gotten faster and cheaper and agents are popping up. They’re all still really bad at doing what they’re advertised for, but they can be highly useful with some work.
The future I’m looking for is one where programming is lifted up to another level of abstraction. Instead of writing a ton of code, we focus a lot more on:
- What, exactly, we want to build.
- How we define an experiment and measurement criteria for what we build.
- Exploring many different approaches to solving a problem all at once.
- Automatically embedding best practices into code like robust tests, good observability and internal documentation, etc.
- Finding out what we don’t know about a problem and how to solve it, and using these tools to facilitate learning about systems and the way they get used.
I think we’re still a long way off from this reality. Even if we assume the current technical hurdles are solved—which is absolutely not a given—there are other systematic hurdles involving human beings and the real world that need addressing.
To me, all signs point towards software engineering changing radically as a profession to be much more oriented around the what and why of software, and much less around the how. This will cause disruption at a massive scale in the long run. But in the short run, it’s a lot of fun to play with these tools and see what they can do.