If you know how to write good code you can force AI to write good code with various techniques. It's 100% doable. You just need to figure out the problems AI has and find solutions to make it easier for it. Ex: extremely small contexts
Modularize to modules with clear boundaries and only allow the AI to work within those boundaries. Make modules pure from IO so they are easily testable. Hide modules behind interfaces etc .. You can write 100 tests that executes within a second. You can write benchmarks etc .. AI needs boundaries and small contexts to work well. If you fail to give it that it will perform poorly. You are in charge.
That doesn't quite work, and precisely for the reason I mentioned: You can definitely tell the AI to follow some strategy, but at some point the strategy will need to change, and the AI won't tell you that (even if you tell it to). Unless you read the code every time you won't know if the AI is following the strategy and producing good results or following it and producing bad results because the strategy has to change. This can happen even in small changes: the AI will follow the strategy even if the change proves it's wrong, and if you don't pay close attention, these mistakes pile up.
So yes, you might get good results in one round, but not over time. What does work is to carefully review the AI's output, although the review needs to be more careful than review of human-written code because the agents are very good at hiding the time bombs they leave behind.
If I instruct the AI to make small modules where I can verify they work, have tests and no side effects - then it is good enough code for me. It works, is readable and can be extended - and will turn into bad code if this is not done with care.
Sure, if you carefully review the agent's output, including tests, you can get good results. If you don't carefully review the output, you obviously have no idea if it's good enough for you. The only way to find out is that 30 changes down the line the agent won't be able to change one thing without breaking another, but by then the codebase will be too far gone to fix.
This is essentially true. There are other ways to achieve this goal though, that don’t require exhaustive human review, better models are able to do that part as well if properly guided. The key is that yes, some of the design constraints will morph over time, necessarily, since coding is as often about discovering the problem as solving it. But design principles don’t drift. If you have a design principle that can not be adhered to, it is not a proper principle, it’s an opinion about the problem.
The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code. Notice that there is 3 separate things written about the code, and the code itself…. Keeping all of that correct, coherent, and consistent (with a separate, invariant document that describes your design principles) keeps the model from going off the rails and gives ample opportunity to sense bad smells before they get set in stone.
It’s a token fire and you need a minimum 250k context model… but I still get as much work done in an hour as I used to do in a day, and the code I coauthor is better documented, more maintainable, and more tested than any code I have ever written before.
> There are other ways to achieve this goal though, that don’t require exhaustive human review, better models are able to do that part as well if properly guided.
Not at this time. Even if you could somehow get their success rate to 90%, it's still far too low because the mistakes can be (and are occassionally) catastrophic. It's only when you review everything that you find mistakes that will bite you down the line. If you don't review everything, you just don't know, but the rate of bad mistakes introduced by the agents is too high to trust, no matter how much prompting and orchestration you do. Maybe future models will address that, but we're not there yet.
> The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code.
That's helpful but it doesn't solve the problem, which is that the agents are happy to introduce horrendous workarounds, and they don't tell you that the code they've written is a horrendous workaround. The docs are fine and reflect the code and the code reflects the strategy, but you just don't know that the strategy is wrong.
I haven’t had this problem. Maybe it’s because of the language I’m using (C++) or maybe it’s because of the strict enforcement of modularity and public vs private interfaces, etc that I use? Also, the code is tested against the hardware with every change. Idk if that’s why my experience has been different from yours or not.
My workflow also requires a discussion of the architecture and methodology of each addition or change, but honestly because we define the interfaces first, and each concern is given its own .c and .h file, it’s very hard to sneak something in without me noticing and calling it out. (Which does happen occasionally)
I suspect that file level granularity may be one of the keys. It never is actually working on more than a couple hundred lines of code at a time, plus interfaces of related files. I end up with a hundred files where I might have had 30 coding by hand, but it is actually easier to reason about the code for me as well, and the number of files is not an issue because of the automation. Total LOC is about the same as I would produce by hand for the same work, which means it’s actually writing less, due to the interface overhead, so I’m pretty stoked about that. The only real nightmare for humans is the long includes.
OTOH if I don’t do all of this it will definitely go off the rails and produce garbage.
I’ve been writing c (and c++) for almost 40 years, and although that doesn’t mean I’m any good, it does mean I have developed a keen sense of smell and highly sensitive olfactory PTSD.
With the right structured environment, a SOTA model with a suspicious seasoned dev holding its hand can be easier to manage and much more productive than a small team. Or, maybe I’ve just sucked so bad my whole life that I can’t tell the difference, but at any rate it works well enough to ship without nightmares, and less bugs and patching than I had before.
Edit:
I should mention that if bugs get tricky, like hardware idiosyncrasies and things like that, the model just goes nuts.if I handle it very very carefully so that it does not try to understand the problem, and I just have it poke the firmware with a stick from a distance enough times and from enough angles, as long as I have successfully prevented it from trying to figure out the problem (which is not as easy as it seems like it would be) it actually will usually nail it. If it starts to guess it’s usually best just to roll back the context and start over with the poking (I have a harness so it does direct hardware probes)
There seems to be an analog for this for non hardware related issues, but it’s harder to sus out when you should be telling it that you specifically do not want it to attempt to understand or solve the problem until you’ve rigged and tested all of the debug messaging.
I don't think our experience is different. Letting the agent work on pieces no bigger than a couple hundred lines at a time and checking if there's something fishy or not and that the code is legible and logical is close human supervision. This is very much not what the people who wish AI could build products for them do or can do at the rate they're moving.
Lol I guess you’ve got a point , but honestly it’s not more supervision than I would give a junior dev, at least until they had developed at least a few months track record of good judgement.
I guess the problem is the blind assumption of competence?
I just think of AI as being a lot like my late friend Henry. Henry had several PHDs, was an accomplished polymath in a bunch of other subjects, and spoke more than 20 languages with reasonable fluency. He was for sure one of the smartest people I ever met.
He was also prone to drinking, and he when he was on a tear, you could barely tell except he would confidently say some of the most outrageous shit, or start speaking some other language without noticing. So you always took Henry with a grain of salt, and if it was important you’d double check. Even so, he was still an amazing resource to bounce things off of.
30 years of experience writing bad code, with no effort to improve, doesn't make you any good. You need to right attitude and humility to become good.
Some of the worst programmers I have ever worked with had 30+ years of experience. They basically spend all of their time fixing bug after bug in a never ending cycle because the software they produced was so fragile that it would crash if you just looked at it wrong or the temperature in the room wasn't perfect.
While others with the same number of years of experience had massive systems in production for years with not a single bug reported by the happy users.
I know I got into such developement hell myself. Fix a bug here, results in braking something there. Experience surely helps in avoiding it .. but even senior devs can make a mess. Otherwise there wouldn't be so many projects canceled.
So sure, agents can multiply a mess in a amazingly short time, but .. that is up to the humans guiding them.
That is correct. Using an AI to generate code and then not verify it yourself is IMHO unprofessional and should get you at a minimum a verbal warning. YOU are responsible for the code NOT the AI.
I let agents break things 30 changes down the line. If something breaks, I add a check to my project validator and start over, with the validator providing instructions on what was wrong and how to fix it. It's all automatic, and now I have a guard against the exact same error in the future.
Some of these checks have caught thousands of the same error, even with the latest Opus 4.7 writing the original code.
To be honest, I am past the point of wanting to convince people that AI is useful, if you want to refuse new tools other people find helpful, your loss.
(Also I stick to the original definition of "vibe coding = not looking at generated code", "LLM assisted coding = verify generated code", I do both, depending on the task)
Down the line the agent is no longer able to fix one failure without causing another and the codebase is unsalvageable, but you may not have reached that point yet.
Agents can help a lot when you carefully review everything they output and find all the time bombs they like hiding in your code and your tests. If not, then they're fine for codebases that don't need to last more than a year or two.
The concept of a small module is an architecture invariant. You’re making that decision, not the LLM. And you’ve made that decision because the machine is not good at certain things. You’re doing that because you can’t trust the LLM to make that decision on its own.
I’m doing it because as a DDD adherent, I’ve been building software that way for 15 years without GenAI and now with GenAI I can do it faster.
You can’t play whack-a-mole with GenAI. You have to start from well-known principles and watch everything it produces. Every module or bounded context has to have its own invariants.
You can’t fully automate software engineering with GenAI. It seems the vast majority of GenAI users think they can and end up in the same place as the OP.
Maybe learn Domain-Driven Design, Event Sourcing, and then try again. The results will be dramatically improved.
Love the DDD callout. I have explicit steps to review and rate delta's to the ubiquitous language and one of my architectural reviewers will often engage with me about where the bounded contexts should be and will probably the translation layers.
I find the more good practices I add to my envision/scope/spec/build/test/deploy loops the happier I am with the outcomes.
I will say that I am finding the actual code to be somewhat ephemeral for me - the more precise the specifications are and generally the tighter and more elegant the design is, the less the code matters as a long term artifact.
I'm not at the "code is assembler" point yet - but I could see that with more, richer specs I could end up there. Of course the specs are then substantial, but declarative specs can be robust and unambigous (with sufficient read teaming review) and - like domain specific languages - reduce the accidental complexity of the syntax when compared to an implementation in a given language.
There are exceptions to all of this, but it's fascinating to see how it's evolving!
That's not the point, the point is they can generate pretty good code, and do that most of the time, so ask them to generate the code, review it as you would review a more junior teammate or an opensource collaboration from an unknown source, and take advantage of their speed to test everything.
You can't make a great vibe-coded thing that you couldn't make yourself, but you can get pretty much the same code you would have made in a fraction of the time.
I have not discovered anything new. I have applied established architecture and engineering practices that I've been following for 15 years to using Claude Code.
Domain-Driven Design started with Eric Evan's "blue book" in 2003 and continued with Vaughn Vernon's "red book" a few years later. Event Storming workshops and event sourcing data storage have also emerged as important tools.
I am just a practitioner. Eric, Vaughn, Paul Reyner, Alberto Brandolini, and a few other architects lead.
My DevArch guardrails is a toolkit to follow these practices.
I encourage skeptics to look at the Sharpee repo at https://github.com/chicagodave/sharpee and specifically docs/context. Every session summary is there going back months.
Code that will not be able to evolve for more than one-two years is terrible code. Agents write terrible code while doing a truly impressive job hiding it (including in the tests they write) unless, of course, you keep them under very close supervision.
I also find that every additional "constraint" you add in your context window, the dumber the agent gets, and it goes double if your constraint is unusual. To illustrate:
"Do x" - for baseline, assume this generally does X fine.
"Do X, don't use javascript". - even if X already didn't use javascript, this will often perform worse. It will perform even _more_ worse if X is difficult or unusual to do without javascript even if there is some perfectly serviceable way to do it.
Also, despite "don't use javascript", sometimes it just still uses a little bit of javascript anyway, and usually in a spot that would actually be extremely annoying/inconvenient to them remove that js yourself (when you would've otherwise reconsidered your approach at a higher level, to either use js, or to just want something different that is easier to do without js).
I feel like there's a limit on constraints that doesn't necessarily follow the context limits. I've assumed this is "attention heads" which I understand are an independent limitation, but I'm not smart enough to understand all the layers involved in these models so I could be wrong there.
I do observe the same thing. There are a limited number of constraints you can add and once you exceed that, you'll play whack-a-mole if you insist on all of them.
This is why I tend toward a more wu-wei attitude to constraints.
For example:
- Do I really need this constraint?
- How does the agent tend to behave in this scenario it if unconstrained? Is this behavior/result an acceptable pattern for this solution?
- Is the constraint implicitly followed often enough that I can trade spending tokens recovering from a deterministic test that enforces the constraint rather than preemptively state it in the prompt?
If I get into the situation where I need more constraints than can fit in context/attention without the need to regularly play whack-a-mole, then I break the module down into sub-modules with fewer, more specific constraints.
you are never going to get away from reading the code every time. at least I haven't seen how you could possibly. That being said, it is considerably less work to read and check the code than it is to have to build it all, even if you know what you're doing and have done it before.
This is reductive. What you're describing already happened in codebases without AI. LLM's just speed up thing because they are a great calculator and not a replacement for human input.
This is actually what I do. I'm extremely picky about the code and force the LLM to rewrite it 1000x times until it is basically exactly what I want. You might be wondering what is the point when it would be faster for me to just write the code myself?
I have ADHD and for whatever reason telling the LLM what to do instead of doing it myself bypasses the task avoidance patterns and/or focus problems I tend to suffer from. I do not find it fun, but I am thankful for it.
I have used LLMs a couple of times to get started on something. I don’t have ADHD, so this is not a regular occurrence for me. But when I have tried this, I have always found the LLM solution so horrible that it instantly inspired me to do it myself. So, in that sense it worked, I got unstuck, but no LLM garbage makes it into the project.
That’s how I use it for writing. I am looking for alternate wording/phrases that I usually don’t use (language habits), alternate takes of any quality just to get myself thinking along different lines, etc.
Rarely do I use what the tool actually spits out. I just use it as a sounding board, like I’m chatting with a (very noob) writer. It doesn’t make me much faster but it helps me break through when I just can’t get words down.
But what if the only real way to break through avoidance patterns is to stop avoiding? What if the tradeoffs of LLMs are instant gratification and further atrophying of your executive functions?
I have gained a paranoid suspicion that our capacity to decrease immediate distress with technology has become so great that we are creating a world where people with certain temperaments can have their personalities become more and more extreme through the assistance of technologies which, for example, decrease the amount of interpersonal interaction required or prevent the need for deep focus.
This framing of it being a tool that you find indispensable as an individual is important. I’m not interested in debating static vs dynamic types, or vim vs emacs, etc. If it works for you, then that’s great!
But the difference with LLMs currently - I guess? - is that non-engineers are pushing the idea that it’s universally indispensable at scale. I think it leads to a lot of emotion bleeding into the debate.
This is what I'm seeing - for people who were slow and didn't posses a lot of depth or breadth, their blast radius and impact has skyrocketed. They can now work in unfamiliar domains quickly, without any knowledge of the nitty gritty details of those domains!
For me personally, it's a tradeoff of generating the first pass code 10x more quickly, but then deeply knowing and validating the code is then 10-20x more work than it would have been if I'd written it myself (and if time is of the essence, then there's the option of shallow validation/understanding in exchange for speed - which is a compromise in rigor and path towards tech debt). In the end, none of this seems like a net win (unless you don't care about quality), and it is much less enjoyable.
TL;DR; While LLMs are faster to spit out first pass code, by the time I've validated and fixed the LLM's first-pass work, I could've had my "by-hand" implementation done correctly, and had much deeper understanding out of the box. Net loss.
Even if it does speed me up, coding with LLMs suck all the joy out of writing software. Constantly babysitting the agent(s) and reviewing their output carefully is probably more exhausting than just writing things myself..
It is significantly easier to micro-manage an AI than a suite of junior developers. The AI doesn't replace a principal engineer, it's replacing junior and weaker senior developers who need stories broken down extremely concisely to be able to get anything done. The time it takes to break down a story such that a junior through weak senior developers can pick it up and execute it well would have the AI already done with testing built around it.
Juniors learn. Some juniors are potential good seniors. Over time they will internalise good architecture and be able to make good judgments on their own.
Micromanaging LLMs is like having Dory from Finding Nemo as your colleague. You find ways to communicate, but there is no learning going on.
LLMs can learn, just not the same way that juniors do. When an LLM does something wrong you can always update it's rules or skills to not make that mistake again. Or you can utilize a subagent whose sole purpose is to review code to prevent that mistake. Lots of ways you can improve LLMs over time.
Of course if you don't provide that feedback loop, no learning happens. I guess the same could be said of a junior, though.
Building larger systems of accountability isn't usually what people mean by learning. And besides, if telling an LLM not to do something were actually reliable, then LLMs would be a lot more useful than they are. And even if that were reliable, then you're just reinventing expert systems, which didn't work.
I'm not sure the point of contention is whether or not an arbitrary language model is capable of understanding new concepts and not make the same mistake again, as it is being used.
When people compare LLMs to juniors it's "can I have it do something pretty brain numbing, and when it makes mistakes can I invest time into preventing that from happening again, either systemically or via training?"
IME this is true for LLMs, at least in how my team has been utilizing them. This doesn't make juniors worthless, as they can be useful for things that LLMs aren't good at.
No, but you can fire them. Can you fire an AI you're paying for? Yes, but your options are another AI that is just as bad, or worse.
And it's really someone's fault for hiring a bad junior. Someone did interview them, right? Maybe the person that hired them is the problem. And maybe the person that decided to go all-in on AI is also the same problem.
I think if you tried working with some junior folks, you'd be quite surprised. You know, with at least some of them choosing to use their brains and all.
Honestly, I think so. I do a mix of infrastructure and programming so don't tend to have any frameworks memorized. Using AI is much quicker than constantly referencing the docs.
I can also switch between codebase with different frameworks and languages and make changes without spending all day reading docs.
It's also pretty good at tracing code and that's fairly straight forward to verify the results manually. It can build a flow diagram in 10-30 minutes (depending on what tool calls need allowed and how many prompts it needs) versus me taking a couple hours to do the same.
I don't micromanage it. I let my projects custom linter micromanage it.
Every project should have a custom linter for their tech stack. It would check for not just syntax errors, but architectural choices as well as taste guidelines.
Whenever the LLM writes bad code, I add it to my linter to check against in the future.
Yeah I agree. It's improved quite a bit just in the past few months. The code should always be reviewed, and you need to spend some time tuning your skills and agent configs. If you're still getting bad code out of your LLM tooling, you might not be using or configuring it correctly.
Sure. That's how I work with AI, and the way I believe that AI is meant to be use -- as a companion tool.
But it's a lot of work. It saves me time for certain tasks, but not others. I haven't measured my productivity gains, but they're at most 2x.
But that's not "vibe coding" (which was the point of the article) or the (false) promise of "10x productivity" and "code that writes itself" that companies are being told is going to reduce their engineering headcount tenfold.
"Force" is often an unrealistic expectation, though. Taking Claude Code as an example: you can add as many rules / guidelines as you want in instruction files, but they will not be followed 100% of the time, and more is not better [1].
You can of course use PreToolUse hooks to block particularly damaging actions of the "rm -rf" variety, but this is also not 100% guaranteed unless you're able to block _all_ ways of performing that damaging action (and you would be surprised: agents will happily write custom python / bash / etc. scripts to do actions you tried to block them from doing!)
Tools help instruct the agent to redo work e.g. to pass linter / formatter checks or relevant tests. But I've also seen them ignore those, often enough to be noticeable: e.g. "17 of 18 tests pass, the other 1 wasn't introduced by this feature" - regardless of whether that's actually true or not, regardless of whether I put "ALWAYS make sure ALL affected tests pass" in an instruction file somewhere.
This isn't to refute your main point: yes, you can improve your chances that AI will write good code. But there is no magic bullet that will force it, 100% of the time, to write good code; this is where vibe coders without requisite coding + engineering skills hit a wall. A multi-layered approach of guidelines + progressive disclosure + tools + hooks indeed reduces the probability of bad code enough to be useful for many engineering tasks.
I completely agree. I have tried N different ways to use AI and the one that really works for me is to step by step getting the AI to build one modular feature at a time (a method, a basic class etc.) I then review and fix if necessary. It works really well.
To me it feels like controlling a power tool. These things have a sort of momentum to them, because they do stuff so fast. It's easy to let the tool get out of hand.