With all the hype of Claude Code, Codex, new models, and new tooling, I think it’s important to take a step back and figure out what the current development experience with AI is like.
It can be easy to get lost in the hype cycle especially since Claude Code and other agentic tools can do so much for us already. So much so that the current sentiment seems to be:
- one-shot everything
- why do we even care about code reviews?
- why do we even care about the output code?
- everything is going great
But the truth is more complicated than that.
My Caveats
I had a rough week with Claude Code. One that started incredibly well and devolved into a really rough time. So I want to include some caveats.
One big one is that Claude Code does work surprisingly well. I’ve used it successfully on multiple projects, both personal and business.
Here is a handful of things I’ve seen Claude Code do for me:
- rewrite and redesign my blog with basically no issues
- building a dashboard for me to keep track of my work (with Github, Git, and Ollama integration)
- the agent-compiler project
- build CLI utilities for various functions. Ex:
- todo list TUI
- TV guide
- a Steam-like client for local games
But it’s gone beyond that at work. I can’t speak to the specific features and product updates, but it’s been able to do more than I could have imagined.
With that said, the problems Claude creates are real problems, some that take more time to fix than if I had started raw-coding, hand-coding, from the beginning.
The Scope of The Issues
I think it’s important to understand tangibly what issues with Claude Code-generated code look like. I’ll be specifically addressing problems that are hidden and problems that fester over time.
I’ll go ahead and forego covering clean code (with AI tools this matters a lot less) and covering large-scale issues where the AI does something completely different than you want it to. Those problems are easier to spot or deal with in my opinion.
Testing
One of the biggest issues I’ve found is with testing with Claude Code. Here are a couple of issues I’ve witnessed myself (despite having a CLAUDE.md file trying to prevent this behavior)
CC will mark failing tests to be skipped
Claude Code will actually skip tests that are failing. There are multiple reasons for this behavior. One of them is that sometimes, Claude tries to “focus” on one problem at a time and will mark tests temporarily to be skipped and come back to them later. Unfortunately, between context rot, compacting/clearing context, starting new sessions, etc., Claude sometimes forgets to come back.
Other times, Claude will claim it didn’t cause the test failures and marks them to be skipped for someone else to deal with.
I’ve also seen Claude just say that the tests were out of scope despite being told otherwise.
Example:
// adds skip to avoid failure
it.skip('Critical feature test', () => {}); CC will create weak or vague assertions to ensure tests pass.
Another issue I’ve run into is that Claude Code will deliberately change a solid assertion into a weak one in order to avoid a test failure. Instead of asserting on a value, it may assert that a key exists. Instead of asserting a library works, it may create a mock to “make things work”, even the “things” it’s supposed to be testing.
Example:
# previous strong asssertion
assert result_value == {
"key1": "value1",
"key2": "value2"
}
# new assertion
assert key1 in result_value
assert key2 in result_value CC will mock failing components
Like I mentioned above, Claude Code will sometimes deliberately mock functions or functionality it’s supposed to be testing in order to avoid a failure.
So imagine you’re testing a data processing pipeline with multiple steps and you want to ensure it works end-to-end via a test. Now imagine that a step in the data transformation is failing due to an edgecase. Instead of surfacing the issue to you or addressing it, CC may just mock that “step” in your test to avoid a failure.
One-Time Exceptions
One other type of error is the “one-time exception” error. Similar to how CC may obscure issues in tests, it may obscure issues in the main codebase as well to avoid detecting a larger-scale issue. CC can be fairly blatant about causing these types of issues.
It can be as bad as this:
if (value === 'specific test value') {
return 'expected test value';
} Other times, it can be more tricky to pick these out. CC can debug an issue a number of ways and find the exact weird combination of elements in order to satisfy a requirement in the worst way possible. Imagine the following situation:
const response = await getSomeData();
// through branching logic, CC may narrow down the specific usecase
if (response.type === DataTypes.DEFAULT) {
// CC may even create an "exceptions list" to emulate clean code
if (response.item_list[0].includes(DATA_EXCEPTIONS.EXCEPTION_TRIGGER_PHRASE.ITEM)) {
// and it may create a constants object to solve this problem.
return DATA_EXCEPTIONS.RESPONSE.ITEM;
}
} Descoping or scope change
Next, let’s talk about scoping. There are multiple reasons for why this happens but the gist of it is that Claude Code will happily cut scope, change plans, and completely derail a project. And it will do so often. It’s not a misunderstanding, it will just do its own thing. There are multiple reasons for that:
Plan mode
CC tries to summarize your requirements and put them in a plan, that step will by default modify the initial requirements. Now, most of the time, things are fine when doing so. That’s the point of plan mode. However, CC is happy to sneak things in and drop stuff if it feels like it’s the right course of action.
So when planning, ensure that CC comes up with a solid plan. The worst part is, CC plans can get so complicated, it’s a large overhead trying to read and verify CC isn’t lying to you.
Example:
- you plan to create a google cloud Todo app and provide a PRD
- claude thinks that sync with a document on google drive is too ambitious and marks it that way
- claude mulls over the plan and drops Google Drive
- Claude presents you with a long and complicated plan to approve and you don’t notice Google Drive missing
- claude, even after reviewing finished work, never sees the feature missing
Execution mode
Even with a solid plan in place, CC will drop features during execution. It’ll implement 7 out of 10 requirements and then announce that everything is complete. The worst part is that it doesn’t tell you it dropped anything – it just moves on as if the missing pieces were never part of the plan. And if you call it out, it’ll sometimes reinterpret the original requirement into something simpler that it did accomplish.
This can especially happen during long-running sessions where you may experience context compaction which may throw out requirements, or a missed requirement when you create a checkpoint in your work.
I’ve found that no amount of being careful, writing solid plans, etc. can make this completely avoidable.
Example:
- Claude is building a new blog
- it created a solid plan without any mistakes like above
- Claude works through the plan
- CC encounters a bug with a table of contents feature
- CC encounters a feature request to integrate with Twitter API that it doesn’t know how to solve yet
- CC finishes the rest of the tasks and when providing you with a summary, neglects to tell you it skipped two steps
Typically, in these situations, CC just thinks something isn’t important or achievable. It’s a spiral:
- first, CC deprioritize a tough task to keep up the momentum
- second, CC marks the deprioritization in a plan/GSD/wherever in some way
- third, the next CC session interprets the deprioritization label however it wants to, often deprioritizing further
- fourth, that CC session saves its progress/plan marking the task even lower priority
- cycle continues
Context Rot
Lastly, there’s the context rot. There are different ways that your context can “rot” – or degrade over time. The degradation itself can result in the issues above. Everything from missed requirements, to ignored directions.
Full context
The first way that we can experience context rot is by filling up context. No, not compaction, but actually having a full context. There are articles online that specify that the maximum amount of context fill should be less than 60%. That’s an astounding number. Why such a low number?
It turns out that early context that was loaded loses meaning and importance the more data you pile on. Imagine a first-in no-out stack. The further down the stack we are, the less relevant that text is toward our current operation.
This means that rules, architecture docs, etc. that get loaded when you start a session basically lose meaning by the time Claude Code reads the 50th file when its context is 80% full. It’s one of the reasons you may see performance degradation as your context meter fills up.
Example:
- CC loads your tests/README.md file with all the context it needs to write tests
- CC starts writing tests, encounters issues, debugs, etc.
- CC loads up multiple other test files in the process filling the context
- CC no longer follows the conventions specified in tests/README.md but follows the pattern of the recently loaded files
Context compaction
Context compaction in general is lossy. And it’s hard for it not to be. We’re not just compressing data in some way, we’re paring down on what data Claude Code thinks is important to rebuild a similar context at a lesser token cost. To go from 90% context use to 15%. That means we’re losing something. That something can be decisions made early on, redirection to Claude Code’s approach, or just a note you made that Claude no longer thinks is important.
The worst-case scenario is that CC re-introduces a bug it had already fixed but forgot about it during the session.
Example:
- CC builds context around building TUIs with the Ink library
- CC starts to build a TUI application based on a PRD
- during the process of building it, CC reaches compaction levels
- compaction occurs and CC loses context around which files it marked to come back to
The Secondary Effects
The secondary effects of these drawbacks can be really uncomfortable. The fact is, doing a single task using an AI is slower. However, with AI, we can parallelize our work so while an individual project takes longer, we can get more work done within a specific block of time.
The problem is that due to the aforementioned issues, we end up spending more time on certain code projects/features than we normally would.
More bugs, harder to debug
Due to the dropped issues, hidden problems, etc., we end up with having harder-to-debug bugs. The problem is that if CC couldn’t catch it during review or verification steps, it may not be able to debug those problems once they occur. And even if CC can debug it, there’s no telling how many bugs there will be or how complicated they will be.
As devs work on code, we try to generally solve issues. We don’t hardcode things like if (todo.text == "some testing text") return value_to_pass_tests. We solve problems. CC can’t entirely do that.
Wrong architectural decisions
When CC skips implementing part of its plan, it can make architectural decisions (or micro-architectural decisions) that make it difficult to implement the missing parts. Or impossible to implement it without large-scale changes which, unfortunately, will result in more problems, more dropped features, etc. We’re just pushing another possibly-broken cycle on top of another in hopes of fixing problems.
Time wasted
Ultimately, CC will waste developer time when these issues pop up. The more complicated a project is, the more frequently these issues will occur. And ultimately, we run into a question of ROI. Will we be saving time? Or will we have artificially increased velocity until we hit real problems?
I think it’ll be interesting to look back at this boom in productivity and see if we end up with increased productivity in the long-term or if we’ll see a downward trend as bugs, issues, and problems rear their head as projects mature.
Is this an argument for more robust AI reviewers?
The obvious answer seems to be “just have another AI review the code.” And sure, that helps with surface-level stuff. But the problems I’ve described aren’t syntax errors or style issues – they’re intent problems. An AI reviewer doesn’t know that CC quietly dropped a requirement or snuck in a hardcoded exception. It doesn’t have the context of what you actually wanted or what happened during your conversation with Claude.
However, I think this is one last place where we can make a difference and prevent our AI tooling from completely messing things up. Why an AI reviewer?
- an AI reviewer can ingest your original context
- an AI reviewer will have no context on the conversations leading up to the completion of work but it can check if requirements are satisfied
- an AI reviewer can check for common pitfalls – like skipped tests, special code created to artificially pass tests, etc.
I feel like we’re ALMOST there. We went from autocomplete, to generating code. From generating code, to building entire features and products. Now we need to close the final gap of reviewing generated code to prevent buggy, unstable, and incomplete code from making it through reviews.