Yep. The only people I've heard saying that generated code is fine are those who don't read it.
The problem is that the mitigations offered in the article also don't work for long. When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it’s small, like the selection of a data structure. You can tell the agent what the constraints are with something like "Views do NOT access other views' state" as the post does.
Except, eventually, you'll want to add a feature that clashes with that invariant. At that point there are usually three choices:
- Don’t add the feature. The invariant is a useful simplifying principle and it’s more important than the feature; it will pay dividends in other ways.
- Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.
- Go back and change the invariant. You’ve just learnt something new that you hadn’t considered and puts things in a new light, and it turns out there’s a better approach.
Often, only one of these is right. Often, at least one of these is very, very wrong, and with bad consequences.
Picking among them isn’t a matter of context. It’s a matter of judgment, and the models - not the harnesses - get this judgment wrong far too often. I would say no better than random chance.
Even if you have an architecture in mind, and even if the agent follows it, sooner or later it will need to be reconsidered. What I've seen is that if you define the architectural constraints, the agent writes complex, unmaintainable code that contorts itself to it when it needs to change. If you don't read what the agent does very carefully - more carefully than human-written code because the agent doesn't complain about contortious code - you will end up with the same "code that devours itself", only you won't know it until it's too late.
If you know how to write good code you can force AI to write good code with various techniques. It's 100% doable. You just need to figure out the problems AI has and find solutions to make it easier for it. Ex: extremely small contexts
Modularize to modules with clear boundaries and only allow the AI to work within those boundaries. Make modules pure from IO so they are easily testable. Hide modules behind interfaces etc .. You can write 100 tests that executes within a second. You can write benchmarks etc .. AI needs boundaries and small contexts to work well. If you fail to give it that it will perform poorly. You are in charge.
That doesn't quite work, and precisely for the reason I mentioned: You can definitely tell the AI to follow some strategy, but at some point the strategy will need to change, and the AI won't tell you that (even if you tell it to). Unless you read the code every time you won't know if the AI is following the strategy and producing good results or following it and producing bad results because the strategy has to change. This can happen even in small changes: the AI will follow the strategy even if the change proves it's wrong, and if you don't pay close attention, these mistakes pile up.
So yes, you might get good results in one round, but not over time. What does work is to carefully review the AI's output, although the review needs to be more careful than review of human-written code because the agents are very good at hiding the time bombs they leave behind.
If I instruct the AI to make small modules where I can verify they work, have tests and no side effects - then it is good enough code for me. It works, is readable and can be extended - and will turn into bad code if this is not done with care.
Sure, if you carefully review the agent's output, including tests, you can get good results. If you don't carefully review the output, you obviously have no idea if it's good enough for you. The only way to find out is that 30 changes down the line the agent won't be able to change one thing without breaking another, but by then the codebase will be too far gone to fix.
This is essentially true. There are other ways to achieve this goal though, that don’t require exhaustive human review, better models are able to do that part as well if properly guided. The key is that yes, some of the design constraints will morph over time, necessarily, since coding is as often about discovering the problem as solving it. But design principles don’t drift. If you have a design principle that can not be adhered to, it is not a proper principle, it’s an opinion about the problem.
The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code. Notice that there is 3 separate things written about the code, and the code itself…. Keeping all of that correct, coherent, and consistent (with a separate, invariant document that describes your design principles) keeps the model from going off the rails and gives ample opportunity to sense bad smells before they get set in stone.
It’s a token fire and you need a minimum 250k context model… but I still get as much work done in an hour as I used to do in a day, and the code I coauthor is better documented, more maintainable, and more tested than any code I have ever written before.
> There are other ways to achieve this goal though, that don’t require exhaustive human review, better models are able to do that part as well if properly guided.
Not at this time. Even if you could somehow get their success rate to 90%, it's still far too low because the mistakes can be (and are occassionally) catastrophic. It's only when you review everything that you find mistakes that will bite you down the line. If you don't review everything, you just don't know, but the rate of bad mistakes introduced by the agents is too high to trust, no matter how much prompting and orchestration you do. Maybe future models will address that, but we're not there yet.
> The main thing that helps me in my workflow is to develop documentation around the code. If the code drifts from the docs, the model will notice and you can decide which was correct, the plan, the maintainer manual, or the code, or the comments in the code.
That's helpful but it doesn't solve the problem, which is that the agents are happy to introduce horrendous workarounds, and they don't tell you that the code they've written is a horrendous workaround. The docs are fine and reflect the code and the code reflects the strategy, but you just don't know that the strategy is wrong.
I haven’t had this problem. Maybe it’s because of the language I’m using (C++) or maybe it’s because of the strict enforcement of modularity and public vs private interfaces, etc that I use? Also, the code is tested against the hardware with every change. Idk if that’s why my experience has been different from yours or not.
My workflow also requires a discussion of the architecture and methodology of each addition or change, but honestly because we define the interfaces first, and each concern is given its own .c and .h file, it’s very hard to sneak something in without me noticing and calling it out. (Which does happen occasionally)
I suspect that file level granularity may be one of the keys. It never is actually working on more than a couple hundred lines of code at a time, plus interfaces of related files. I end up with a hundred files where I might have had 30 coding by hand, but it is actually easier to reason about the code for me as well, and the number of files is not an issue because of the automation. Total LOC is about the same as I would produce by hand for the same work, which means it’s actually writing less, due to the interface overhead, so I’m pretty stoked about that. The only real nightmare for humans is the long includes.
OTOH if I don’t do all of this it will definitely go off the rails and produce garbage.
I’ve been writing c (and c++) for almost 40 years, and although that doesn’t mean I’m any good, it does mean I have developed a keen sense of smell and highly sensitive olfactory PTSD.
With the right structured environment, a SOTA model with a suspicious seasoned dev holding its hand can be easier to manage and much more productive than a small team. Or, maybe I’ve just sucked so bad my whole life that I can’t tell the difference, but at any rate it works well enough to ship without nightmares, and less bugs and patching than I had before.
Edit:
I should mention that if bugs get tricky, like hardware idiosyncrasies and things like that, the model just goes nuts.if I handle it very very carefully so that it does not try to understand the problem, and I just have it poke the firmware with a stick from a distance enough times and from enough angles, as long as I have successfully prevented it from trying to figure out the problem (which is not as easy as it seems like it would be) it actually will usually nail it. If it starts to guess it’s usually best just to roll back the context and start over with the poking (I have a harness so it does direct hardware probes)
There seems to be an analog for this for non hardware related issues, but it’s harder to sus out when you should be telling it that you specifically do not want it to attempt to understand or solve the problem until you’ve rigged and tested all of the debug messaging.
I don't think our experience is different. Letting the agent work on pieces no bigger than a couple hundred lines at a time and checking if there's something fishy or not and that the code is legible and logical is close human supervision. This is very much not what the people who wish AI could build products for them do or can do at the rate they're moving.
Lol I guess you’ve got a point , but honestly it’s not more supervision than I would give a junior dev, at least until they had developed at least a few months track record of good judgement.
I guess the problem is the blind assumption of competence?
I just think of AI as being a lot like my late friend Henry. Henry had several PHDs, was an accomplished polymath in a bunch of other subjects, and spoke more than 20 languages with reasonable fluency. He was for sure one of the smartest people I ever met.
He was also prone to drinking, and he when he was on a tear, you could barely tell except he would confidently say some of the most outrageous shit, or start speaking some other language without noticing. So you always took Henry with a grain of salt, and if it was important you’d double check. Even so, he was still an amazing resource to bounce things off of.
30 years of experience writing bad code, with no effort to improve, doesn't make you any good. You need to right attitude and humility to become good.
Some of the worst programmers I have ever worked with had 30+ years of experience. They basically spend all of their time fixing bug after bug in a never ending cycle because the software they produced was so fragile that it would crash if you just looked at it wrong or the temperature in the room wasn't perfect.
While others with the same number of years of experience had massive systems in production for years with not a single bug reported by the happy users.
I know I got into such developement hell myself. Fix a bug here, results in braking something there. Experience surely helps in avoiding it .. but even senior devs can make a mess. Otherwise there wouldn't be so many projects canceled.
So sure, agents can multiply a mess in a amazingly short time, but .. that is up to the humans guiding them.
That is correct. Using an AI to generate code and then not verify it yourself is IMHO unprofessional and should get you at a minimum a verbal warning. YOU are responsible for the code NOT the AI.
I let agents break things 30 changes down the line. If something breaks, I add a check to my project validator and start over, with the validator providing instructions on what was wrong and how to fix it. It's all automatic, and now I have a guard against the exact same error in the future.
Some of these checks have caught thousands of the same error, even with the latest Opus 4.7 writing the original code.
To be honest, I am past the point of wanting to convince people that AI is useful, if you want to refuse new tools other people find helpful, your loss.
(Also I stick to the original definition of "vibe coding = not looking at generated code", "LLM assisted coding = verify generated code", I do both, depending on the task)
Down the line the agent is no longer able to fix one failure without causing another and the codebase is unsalvageable, but you may not have reached that point yet.
Agents can help a lot when you carefully review everything they output and find all the time bombs they like hiding in your code and your tests. If not, then they're fine for codebases that don't need to last more than a year or two.
The concept of a small module is an architecture invariant. You’re making that decision, not the LLM. And you’ve made that decision because the machine is not good at certain things. You’re doing that because you can’t trust the LLM to make that decision on its own.
I’m doing it because as a DDD adherent, I’ve been building software that way for 15 years without GenAI and now with GenAI I can do it faster.
You can’t play whack-a-mole with GenAI. You have to start from well-known principles and watch everything it produces. Every module or bounded context has to have its own invariants.
You can’t fully automate software engineering with GenAI. It seems the vast majority of GenAI users think they can and end up in the same place as the OP.
Maybe learn Domain-Driven Design, Event Sourcing, and then try again. The results will be dramatically improved.
Love the DDD callout. I have explicit steps to review and rate delta's to the ubiquitous language and one of my architectural reviewers will often engage with me about where the bounded contexts should be and will probably the translation layers.
I find the more good practices I add to my envision/scope/spec/build/test/deploy loops the happier I am with the outcomes.
I will say that I am finding the actual code to be somewhat ephemeral for me - the more precise the specifications are and generally the tighter and more elegant the design is, the less the code matters as a long term artifact.
I'm not at the "code is assembler" point yet - but I could see that with more, richer specs I could end up there. Of course the specs are then substantial, but declarative specs can be robust and unambigous (with sufficient read teaming review) and - like domain specific languages - reduce the accidental complexity of the syntax when compared to an implementation in a given language.
There are exceptions to all of this, but it's fascinating to see how it's evolving!
That's not the point, the point is they can generate pretty good code, and do that most of the time, so ask them to generate the code, review it as you would review a more junior teammate or an opensource collaboration from an unknown source, and take advantage of their speed to test everything.
You can't make a great vibe-coded thing that you couldn't make yourself, but you can get pretty much the same code you would have made in a fraction of the time.
I have not discovered anything new. I have applied established architecture and engineering practices that I've been following for 15 years to using Claude Code.
Domain-Driven Design started with Eric Evan's "blue book" in 2003 and continued with Vaughn Vernon's "red book" a few years later. Event Storming workshops and event sourcing data storage have also emerged as important tools.
I am just a practitioner. Eric, Vaughn, Paul Reyner, Alberto Brandolini, and a few other architects lead.
My DevArch guardrails is a toolkit to follow these practices.
I encourage skeptics to look at the Sharpee repo at https://github.com/chicagodave/sharpee and specifically docs/context. Every session summary is there going back months.
Code that will not be able to evolve for more than one-two years is terrible code. Agents write terrible code while doing a truly impressive job hiding it (including in the tests they write) unless, of course, you keep them under very close supervision.
I also find that every additional "constraint" you add in your context window, the dumber the agent gets, and it goes double if your constraint is unusual. To illustrate:
"Do x" - for baseline, assume this generally does X fine.
"Do X, don't use javascript". - even if X already didn't use javascript, this will often perform worse. It will perform even _more_ worse if X is difficult or unusual to do without javascript even if there is some perfectly serviceable way to do it.
Also, despite "don't use javascript", sometimes it just still uses a little bit of javascript anyway, and usually in a spot that would actually be extremely annoying/inconvenient to them remove that js yourself (when you would've otherwise reconsidered your approach at a higher level, to either use js, or to just want something different that is easier to do without js).
I feel like there's a limit on constraints that doesn't necessarily follow the context limits. I've assumed this is "attention heads" which I understand are an independent limitation, but I'm not smart enough to understand all the layers involved in these models so I could be wrong there.
I do observe the same thing. There are a limited number of constraints you can add and once you exceed that, you'll play whack-a-mole if you insist on all of them.
This is why I tend toward a more wu-wei attitude to constraints.
For example:
- Do I really need this constraint?
- How does the agent tend to behave in this scenario it if unconstrained? Is this behavior/result an acceptable pattern for this solution?
- Is the constraint implicitly followed often enough that I can trade spending tokens recovering from a deterministic test that enforces the constraint rather than preemptively state it in the prompt?
If I get into the situation where I need more constraints than can fit in context/attention without the need to regularly play whack-a-mole, then I break the module down into sub-modules with fewer, more specific constraints.
you are never going to get away from reading the code every time. at least I haven't seen how you could possibly. That being said, it is considerably less work to read and check the code than it is to have to build it all, even if you know what you're doing and have done it before.
This is reductive. What you're describing already happened in codebases without AI. LLM's just speed up thing because they are a great calculator and not a replacement for human input.
This is actually what I do. I'm extremely picky about the code and force the LLM to rewrite it 1000x times until it is basically exactly what I want. You might be wondering what is the point when it would be faster for me to just write the code myself?
I have ADHD and for whatever reason telling the LLM what to do instead of doing it myself bypasses the task avoidance patterns and/or focus problems I tend to suffer from. I do not find it fun, but I am thankful for it.
I have used LLMs a couple of times to get started on something. I don’t have ADHD, so this is not a regular occurrence for me. But when I have tried this, I have always found the LLM solution so horrible that it instantly inspired me to do it myself. So, in that sense it worked, I got unstuck, but no LLM garbage makes it into the project.
That’s how I use it for writing. I am looking for alternate wording/phrases that I usually don’t use (language habits), alternate takes of any quality just to get myself thinking along different lines, etc.
Rarely do I use what the tool actually spits out. I just use it as a sounding board, like I’m chatting with a (very noob) writer. It doesn’t make me much faster but it helps me break through when I just can’t get words down.
But what if the only real way to break through avoidance patterns is to stop avoiding? What if the tradeoffs of LLMs are instant gratification and further atrophying of your executive functions?
I have gained a paranoid suspicion that our capacity to decrease immediate distress with technology has become so great that we are creating a world where people with certain temperaments can have their personalities become more and more extreme through the assistance of technologies which, for example, decrease the amount of interpersonal interaction required or prevent the need for deep focus.
This framing of it being a tool that you find indispensable as an individual is important. I’m not interested in debating static vs dynamic types, or vim vs emacs, etc. If it works for you, then that’s great!
But the difference with LLMs currently - I guess? - is that non-engineers are pushing the idea that it’s universally indispensable at scale. I think it leads to a lot of emotion bleeding into the debate.
This is what I'm seeing - for people who were slow and didn't posses a lot of depth or breadth, their blast radius and impact has skyrocketed. They can now work in unfamiliar domains quickly, without any knowledge of the nitty gritty details of those domains!
For me personally, it's a tradeoff of generating the first pass code 10x more quickly, but then deeply knowing and validating the code is then 10-20x more work than it would have been if I'd written it myself (and if time is of the essence, then there's the option of shallow validation/understanding in exchange for speed - which is a compromise in rigor and path towards tech debt). In the end, none of this seems like a net win (unless you don't care about quality), and it is much less enjoyable.
TL;DR; While LLMs are faster to spit out first pass code, by the time I've validated and fixed the LLM's first-pass work, I could've had my "by-hand" implementation done correctly, and had much deeper understanding out of the box. Net loss.
Even if it does speed me up, coding with LLMs suck all the joy out of writing software. Constantly babysitting the agent(s) and reviewing their output carefully is probably more exhausting than just writing things myself..
It is significantly easier to micro-manage an AI than a suite of junior developers. The AI doesn't replace a principal engineer, it's replacing junior and weaker senior developers who need stories broken down extremely concisely to be able to get anything done. The time it takes to break down a story such that a junior through weak senior developers can pick it up and execute it well would have the AI already done with testing built around it.
Juniors learn. Some juniors are potential good seniors. Over time they will internalise good architecture and be able to make good judgments on their own.
Micromanaging LLMs is like having Dory from Finding Nemo as your colleague. You find ways to communicate, but there is no learning going on.
LLMs can learn, just not the same way that juniors do. When an LLM does something wrong you can always update it's rules or skills to not make that mistake again. Or you can utilize a subagent whose sole purpose is to review code to prevent that mistake. Lots of ways you can improve LLMs over time.
Of course if you don't provide that feedback loop, no learning happens. I guess the same could be said of a junior, though.
Building larger systems of accountability isn't usually what people mean by learning. And besides, if telling an LLM not to do something were actually reliable, then LLMs would be a lot more useful than they are. And even if that were reliable, then you're just reinventing expert systems, which didn't work.
I'm not sure the point of contention is whether or not an arbitrary language model is capable of understanding new concepts and not make the same mistake again, as it is being used.
When people compare LLMs to juniors it's "can I have it do something pretty brain numbing, and when it makes mistakes can I invest time into preventing that from happening again, either systemically or via training?"
IME this is true for LLMs, at least in how my team has been utilizing them. This doesn't make juniors worthless, as they can be useful for things that LLMs aren't good at.
No, but you can fire them. Can you fire an AI you're paying for? Yes, but your options are another AI that is just as bad, or worse.
And it's really someone's fault for hiring a bad junior. Someone did interview them, right? Maybe the person that hired them is the problem. And maybe the person that decided to go all-in on AI is also the same problem.
I think if you tried working with some junior folks, you'd be quite surprised. You know, with at least some of them choosing to use their brains and all.
Honestly, I think so. I do a mix of infrastructure and programming so don't tend to have any frameworks memorized. Using AI is much quicker than constantly referencing the docs.
I can also switch between codebase with different frameworks and languages and make changes without spending all day reading docs.
It's also pretty good at tracing code and that's fairly straight forward to verify the results manually. It can build a flow diagram in 10-30 minutes (depending on what tool calls need allowed and how many prompts it needs) versus me taking a couple hours to do the same.
I don't micromanage it. I let my projects custom linter micromanage it.
Every project should have a custom linter for their tech stack. It would check for not just syntax errors, but architectural choices as well as taste guidelines.
Whenever the LLM writes bad code, I add it to my linter to check against in the future.
Yeah I agree. It's improved quite a bit just in the past few months. The code should always be reviewed, and you need to spend some time tuning your skills and agent configs. If you're still getting bad code out of your LLM tooling, you might not be using or configuring it correctly.
Sure. That's how I work with AI, and the way I believe that AI is meant to be use -- as a companion tool.
But it's a lot of work. It saves me time for certain tasks, but not others. I haven't measured my productivity gains, but they're at most 2x.
But that's not "vibe coding" (which was the point of the article) or the (false) promise of "10x productivity" and "code that writes itself" that companies are being told is going to reduce their engineering headcount tenfold.
"Force" is often an unrealistic expectation, though. Taking Claude Code as an example: you can add as many rules / guidelines as you want in instruction files, but they will not be followed 100% of the time, and more is not better [1].
You can of course use PreToolUse hooks to block particularly damaging actions of the "rm -rf" variety, but this is also not 100% guaranteed unless you're able to block _all_ ways of performing that damaging action (and you would be surprised: agents will happily write custom python / bash / etc. scripts to do actions you tried to block them from doing!)
Tools help instruct the agent to redo work e.g. to pass linter / formatter checks or relevant tests. But I've also seen them ignore those, often enough to be noticeable: e.g. "17 of 18 tests pass, the other 1 wasn't introduced by this feature" - regardless of whether that's actually true or not, regardless of whether I put "ALWAYS make sure ALL affected tests pass" in an instruction file somewhere.
This isn't to refute your main point: yes, you can improve your chances that AI will write good code. But there is no magic bullet that will force it, 100% of the time, to write good code; this is where vibe coders without requisite coding + engineering skills hit a wall. A multi-layered approach of guidelines + progressive disclosure + tools + hooks indeed reduces the probability of bad code enough to be useful for many engineering tasks.
I completely agree. I have tried N different ways to use AI and the one that really works for me is to step by step getting the AI to build one modular feature at a time (a method, a basic class etc.) I then review and fix if necessary. It works really well.
To me it feels like controlling a power tool. These things have a sort of momentum to them, because they do stuff so fast. It's easy to let the tool get out of hand.
I agree with this. I've been writing a new internal framework at work and migrating consumers of the old framework to the new one.
I had strong principles at the outset of the project and migrated a few consumers by hand, which gave me confidence that it would work. The overall migration is large and expensive enough that it has been deferred for nearly a decade. Bringing down the cost of that migration made me turn to AI to accelerate it.
I found that it was OK at the more mechanical and straightforward cases, which are 80% of the use cases, to be fair. The remaining 20% need changes to the framework. Most of them need very small changes, such as an extra field in an API, but one or two require a partial conceptual redesign.
To over simplify the problem, the backend for one system can generate certain data in 99% of cases. In a few critical cases, it logically cannot, and that data must be reported to it. Some important optimizations were made with the assumption that this would be impossible.
The AI tooling didn't (yet) detect this scenario and happily added migration logic assuming it would work properly.
Now, because of how this is being rolled out, this wasn't a production bug or anything (yet). However, asking the right questions to partner teams revealed it and unearthed that some others were going to need it as well.
Ultimately, it isn't a big problem to solve in a way that will mostly satisfy everyone, but it would have been a big problem without a human deeper in the weeds.
Over time, this may change. Validation tooling I built may make a future migration of this kind easier to vibe code even if AI functionality doesn't continue to improve. Smarter models with more context will eventually learn these problems in more and more cases.
The code it generates still oscilates between beautiful and broken (or both!) so for now my artistic sensibilities make me keep a close eye on it. I think of the depressed robot from the Hitchhiker's Guide to the Galaxy as the intelligence behind it. Maybe one day it'll be trustworthy
> What I've seen is that if you define the architectural constraints, the agent writes complex, unmaintainable code...
To be fair, there are many people like this as well. One of my personal favorite examples was way back in the 80s when I inherited the code for a protocol converter that let ASCII terminals communicate with IBM mainframes via the 3270 protocol.
One of the pieces of code in there, for managing indicator lights, was simply wrong. It was ca. 150 lines of Z80 assembly language that was trying to faithfully follow the copious IBM documentation of how things worked, but it had subtle issues and didn't always work.
My approach was to accept the documentation as accurate (the IBM documentation was always verbose and almost never wrong), but to reason that the original 3270 had these functions implemented in TTL logic gates, and there was no way in heck that they were wasting enough gates on indicator lights to require the logical equivalent of 150 instructions.
So in my mind, it had to be a really simple circuit that had emergent properties that required the reams of documentation. With that mindset, I was able to craft correct code for this in 12 instructions.
Many systems are likewise fractal in nature. You want to figure out the generating equations, rather than all the rules that derive from those. And, in many cases, writing down the generating equations is at least as easy to do in code as it would be to do in English for someone or something else to implement.
> eventually, you'll want to add a feature that clashes with that invariant
I find this to be a big problem with spec driven development: no spec survives the real world, some invariant that was in the spec will inevitably turn out to be wrong, no matter how much time you spend researching and designing the spec.
When I as a human hit this during development, I can take a step back and think it through, and decide oh yes, the invariant is wrong and needs to be thought through again, and the impact of changing it needs to be assessed. Then I can design around it. Sometimes that means a substantial change in design, sometimes not, but in all times the resulting software is better for it: an unknown has been uncovered, something new has been learned.
When this happens to AI, it keeps churning on it until it manages to hack a solution together, under the potentially wrong assumptions, design, or invariant. It doesn’t have the insight to step back and holistically reevaluate.
At least, that’s been my experience working with AI. I think we can improve its ability to handle these situations, through good workflows and verification, but it’s not something that comes natural to AI and not something Claude code or whatever support out of the box and it’s got its limits.
“The only people I've heard saying that generated code is fine are those who don't read it.” Are you sure these people aren’t busy working rather than chatting? (haha)
But in all seriousness it depends on what you’re doing with it. Writing a quick tool using an LLM is much easier than context changing to write it yourself. If you need the tool, that’s very valuable.
Also as a webdev, it writes basic CRUD pretty good. I am tired of having to build forms myself and the LLMs are usually really good at that.
Been building a new app with lots of policies and whatnot and instructing a LLM is just much faster than doing the same repetitive shit over and over myself.
If you were tired of writing forms yourself, had you looked at https://jsonforms.io/? Just specify the the data you need, or extract it from the api spec and go. Display the form uniformly every time across your site. No need to burn AI time.
I typically avoid any most abstractions or third party dependencies. Yea it could be neat, but I still need a lot of custom logic here and there.
Same reason I avoid stuff like GraphQL.
A little update: upon viewing the page on phone, for me the "comitter" field in the demo is going out of bounds... Really not speaking for their product.
I think you're missing the point of the commenter. A third party library is a new dependency. Since there's new vulnerabilities almost every week in the npm ecosystem, if you can do something without a third party, it's probably better.
With LLM driven code you can generate code once, and then if anything is shitty about it you can always manually update it yourself without the need of an LLM. It's a dependency of convenience, not an app-dependency.
From the description of the recommended tool it sounded to me like something that you use to deterministically generate code from a spec, which you could then modify if you like. That would be the same kind of dependency as the LLM workflow you describe, except that the abstraction is well-defined in a way that the LLM is not. Whether it's good or not is a different question.
That would be nice if it were the case but from what I can gather from this interesting dependency graph, there's a hard dependency on its renderer and schema.
I don't know or care about that specific tool, or really what you do at all, I was just reacting to how the principle you stated conflicts with the practice you described. How you reconcile those is up to you.
Mate, that's literally what you implied, innit? You probably "can" do it yourself, but you choose not to - I wonder why? Also the point of sarcasm is to communicate it in such way that it is obvious, without using the "/s" signifier. You know like, telling a joke at a party that you don't have to explain.
Isn't that the whole concept of "technical debt" though? This has been how software has been developed for quite a while, even pre-LLM. Sometimes your boss puts a thousand things on your plate and you take shortcuts on less important things to save time, and sometimes it works out well and sometimes it doesn't.
Yea because having 200 different abstractions and DSLs makes stuff easier for sure! Why not use all the stuff that was popular 6 years ago like Prisma, GraphQL and Redux, whoops suddenly you need a whole team of devs knowing all kinds of unecessary abstractions.
Based on the examples you provided, I think the term you're looking for is "external dependencies" not "abstractions"
Edit: Incidentally, I tend to treat "code made by an LLM" and "external dependencies" pretty much the same. Pretty low trust, with a strong interface between it and any code that matters
Having a JSON file handle a form schema I provide abstracts away directly building the form myself with actual tech supported by most browsers, hence why I call it abstraction.
I usually only use stuff that either is raw Js, HTML, CSS or whatever builds on top of it. Never something that introduces some DSL and generates files for said environments.
> Prisma, GraphQL and Redux, whoops suddenly you need a whole team of devs knowing all kinds of unecessary abstractions.
Ah, let me guess / you're one of those non-technical PMs who can finally shove it to the devs - by spitting out unreadable HTML storing all it's data in a flat file? Oh boy, do I have news for you...
I am actually a full stack dev working with Vue and Laravel a lot atm. Also have quite some experience with Golang. I like lightweight frameworks and simple stuff, and yes, I avoid solutions by people trying to be smart over being simple.
...which means you depend on the LLMs? Of course strictly "to save time". It's not like you are slowly forgetting how to start a project in the first palce or implement that db integration, right?
LOL why would I ever forget how to start a project or how to connect to a DB or make migrations and whatnot, brother generating a web form for creating and updating models is not that big of a deal. A LLM can do this while providing a11y attributes and proper styling in like 10 minutes. This includes creating a migration which I take a look at and correct if needed, creating the model, creating required policies, creating the controller endpoints which i correct in case its needed, creating a template file for the crud operations with search and pagination and whatnot while making it somewhat look good.
I can do all of this myself, but why would I waste 1-2 hours (per model) on doing all that myself if I can just instruct some stupid LLM to do it for me? It's repetitive boilerplate.
This is a weird thing to point out. I've always had to look up how to start a project even before LLMs, even with years of experience. With React there's vite, React router, nextjs, tanstack. With nodejs there's Koa, hapi, express, and tons others.
Most fullstack engineers are likely not starting a lot of new projects at work, and may only be doing it a few times for side projects, LLM or no.
I‘ve used it in a previous engagement. Unfortunately it’s not customizable enough, and performance for deep forms is really bad. Also, I‘d definitely use agents to set it up.
This the core unspoken bone of contention in most AI arguments I think: most people either arent writing code with strict quality requirements or dont realize where their use of AI is violating them.
That said most of the world's most useful code has strict quality requirements. Even before AI 90% of SLOC would be tossed away without much if any use, 9% was used infrequently while 1% runs half the world's software.
I think this misses the scale of the problem. Review never fixed tech debt, nor did it fix relevant/bloated test suites. It didn't solve complexity, or eliminate footguns. Very few people (I would argue almost noone) had developed theories for what all of these even were, or how to spot them in code.
Reviewers aren't perfect, far from it. And we just gave them ~20x more code to review. Incentives mean that taking 20x longer to review is unacceptable. So where do we go from here?
I'm really amused at how quickly things changed from "yes, the AI is writing it all, but I'm carefully reviewing every line" to "it would take too long and be too confusing to review any of it"
Agreed. Reviewing code takes so much longer and is far more exhausting than writing it, and you still don’t understand the logic as well or intuitively as you would if you write it.
Code reviews should be done by someone other than the author though, so the only thing that changes with ai generated code in that respect is the amount of it
Before: One person writes the code (and likely understands it thoroughly), another person reviews the code to spot obvious mistakes or shortcomings. Now: AI writes the code, a person reviews it to spot obvious mistakes or shortcomings.
In the before case, you have a person who has a deeper understanding of the code and in the AI case, you don’t, instead you have even more code to review.
When a competent programmer is writing the code, the human written code tends to be higher quality too. So it’s not just about review quantity but the quality of code being reviewed. Some people claim the AI writes great code, but that just hasn’t been my experience yet (at least with the models I’ve tried, including Opus). They still make ridiculously bad decisions regularly.
>When a competent programmer is writing the code, the human written code tends to be higher quality too
This is a great idea, but on average is deeply untrue. Far and away most programmers today write significantly worse code than LLMs. Also LLMs are fantastic at generating high level summaries and comments in code
> Far and away most programmers today write significantly worse code than LLMs
Your experience with LLMs do not match my own. Not to say that I haven’t experienced terrible human written code where I’ve wondered what the author could possibly have been thinking, but overall, I still find LLM written code to be on the poor side.
Like, the code itself is ok, but the wider picture reasoning and abstractions are bad. It also makes really dumb decisions far too often. Or doggedly shoehorns its first idea in no matter how badly it fits.
The invariant, stated informally, would be hard to prove is broken by a human reviewer in the loop. Spoken language isn’t precise enough for the task.
Even if you could state it in a precise formal language the LLM under the agent doesn’t have the capability to understand what the invariant is for and why it’s important. You’ll still get oddly generated code. You might get an LLM that can associate certain tokens with those in the formal language specification which can hold invariants and perhaps even write the proofs… but you’ll still get a whole bunch of other code generated from the informal parts of the prompt.
I agree that simply adding constraints and prompts to you skills and specs isn’t going to prevent these things. Worse, that even if you could invent a better mouse trap the creature will still escape.
The problem is… “elongation:” the addition of code for the sake of the prompt/task/etc. Often less is better. This takes a human with the ability to anticipate what other humans would want/expect. When you need a generator, they’re great but it’s a firehouse that whose use should be restrained a little more.
> The invariant, stated informally, would be hard to prove is broken by a human reviewer in the loop. Spoken language isn’t precise enough for the task.
That depends on the invariant. Some are behavioural, like "variable x must be even if y is positive", but some are architectural, such as "a new view requires a new class".
But that's only one side of the problem because maintaining the invariant can be just as bad as breaking it. You ask the agent to add a feature and it may well maintain the invariant - only it shouldn't have, because the feature uncovers the fact that the invariant is architecturally wrong.
The problem is that evolving software requires exercising judgment about when you need to follow the existing strategy and when you need to rethink it. If there is any mechanical rule that could state what the right judgment is, I don't know what it is.
Yes! I was trying to make this part of my point but you definitely made it much more clear and concise.
With a skilled operator, it could be possible to drive an agent to handle these kinds of changes. I would be concerned that spoken language wouldn't be precise enough to handle the refactoring and changes necessary to make to a code base when an invariant changes... regardless of whether it was a property, architectural, or procedural change. It already can take several prompts and burn quite a few tokens doing large-scale rewrites and code changes. Maybe the parameters and weights can be tuned for this kind of work but I remain skeptical that what we have at present is "efficient" at this kind of work.
And the solution is the same, as when it was outsourced- and the "patch" was fix it by writing spec. Thus i conclude my TED talk with the statement: LLMs are the new outsourcing and run into the same problems.
Not quite, because the architecture often needs to evolve when you learn more as the project evolves. People will complain when they feel the constraints drive them to unnatural workarounds, the agents don't.
You can try telling the agent to stop and ask when a constraint proves problematic, except it doesn't have as good a judgment as humans to know when that's the case. I often find myself saying, "why did you write that insane code instead of raising the alarm about a problem?" and the answer is always, "you're absolutely right; I continued when I should have stopped." Of course, you can only tell when that happens if you carefully review the code.
So I run a solo saas that supports my family, and so the stakes feel very high for me. I use AI heavily, and I’ve seen the exact problem you’re describing. I feel like I’m often really riding the edge in terms of trying to use AI to accelerate product development while not letting tech debt accumulate too fast, or let my mental model of the codebase slip too much.
Here’s what’s working for me right now:
1. The basics: use best model available, have skills and rules that specify project guidelines, etc.
2. Always use plan mode. It works much better to iterate on the concept of what we’re going to do, then do the implementation. The models will adhere to the plan at very high rates in my experience.
3. Don’t give chunks of work that are too large in scope. This is just art, and I’m constantly experimenting with how ambitious I can be.
4. I review all code to some extent, but I have a strong mental model of what areas of the app are more critical, where hidden bugs might accumulate, etc, and I review both tests and impl more strenuously in those areas. Whereas like a widget for my admin panel probably gets a 2 second glance.
5. Have the discipline to go through periodically and clean up tech debt, refactor things that you’d do differently now, etc. I find the AI a huge help here, because I can clean up cruft in an hour that would have once taken me days, and thus probably wouldn’t have gotten done.
6. I’m experimenting with shifting my architecture to make it easier to review AI code, make it less likely it’ll make mistakes, etc. Honestly mostly things I should have always been doing, but the level of formalism and abstraction on my solo projects is usually different than on a bigger team.
To each their own, but I’ve grown this from nothing to about $350k in ARR over the last ten months, and I’m very confident I never could have built this product without AI help in triple that time.
I mean, a lot of companies learned that lesson the hard way. It turns out that skilled workers cost money in India just like in the US, so if you pay bottom of the barrel rates you get crappy talent. So many companies wound up winding back their outsourcing initiatives when they produced subpar results. I predict that exactly the same thing will happen with the AI craze.
And not only do you need to read the output of the code, but you need to write code, at least in my experience. I've had a quirky architecture pattern that I've been using for about 2 months now, and every time I use it I've felt slightly unsettled. I finally had a realization last night that it's not a good abstraction and how to divide it better. But, I don't feel that pain nearly as acutely when I have an LLM generate my code, so it's taken me longer to register that there is an issue, and also how to address it.
Ancillary parts I don't mind generating, but for core features I still need to be actively writing most of the time.
>Yep. The only people I've heard saying that generated code is fine are those who don't read it.
If you already have a mature code base, then it's very easy to get AI to write excellent code. It has a ton of documentation on what you already do, how you do things, functions to use etc.
I read all the changes AI does. I work in small chunks.
>Even if you have an architecture in mind, and even if the agent follows it, sooner or later it will need to be reconsidered
The agent can modify the structure you want to change to 100x faster than you can. That's the beauty of it. We all know how hard it is manually to make architectural changes once you've started to lock into something.
These comments just show me you must not be using AI in the right way, or haven't used it enough to learn "how" to use it. I've been using claude code months now at full speed. You are simply wrong that it doesn't generate good code.
I'm surprised this still needs to be said. I'm convinced that posts like these are from people that let the LLM run wild. Small chunk PRs is the key whether its a human or an LLM
The generated code is more than fine, it’s good in many cases. And I read it :)
Indeed for the task of “jump into an unfamiliar codebase and make a requested change that aligns with existing styles and patterns, and uses existing functionality” I would say something like opus 4.7 exceeds the capabilities of most developers.
I agree with both statements, but that doesn't change the problem I stated. If an agent produces reasonable code 80-90% of the time, and 10-20% of the time it makes mistakes that could render the codebase irretrievably unevolvable once they accumulate, the only thing you can do is to carefully review the agent's output 100% of the time. That it gets things right 80% of the time as opposed to 40% of the time doesn't change this calculus one iota.
But agents generate code much faster, and to know slow them down, some people want to not do the only thing that can currently ensure you get good results, which is to carefully review the output. Once that happens, there is simply no way for them to know how good or bad what they're getting is.
Human developers don't produce code at such a rate, and their judgment is, on average, better. So one, the review doesn't make you feel like you're slowing things down much, and two, the problems are less hidden.
I can only presume you work with talented people somewhere that is not representative of most companies. You're definitely overestimating the average programmer's abilities.
Well, the AI's judgment (i.e. if you accept it) leads to a codebase that cannot handle evolution for more than 18-24 months or thereabouts. If you bother to look you can literally see it rotting at 5x speed (all while passing all tests, especially the ones it writes, right up until the point it collapses and cannot be saved). Since most software codebases last longer, whoever is in charge of the judgment - be they average or not - is obviously doing a far better job than today's LLMs.
I don't agree and in my experience the rot happens way faster in handcrafted codebases with constant requirement ratcheting. You resort to shortcuts and code duplication to avoid breaking existing things. This is just the reality when you work under stress in a growing company. AI is much better at keeping up without deteriorating it.
I tend to agree. Taking shortcuts are one thing, not daring to refactor along the way another. I would only do this in low stress situations due to the risk of producing new bugs or issues, and just lacking the time to properly update tests etc. Opus 4.7 sometimes makes suboptimal design decisions, especially in terms of overcomplicating things, but I have not seen it produce an actual bug in smaller changes in a long while.
The other is using Agents as critical reviewers. I've let Opus 4.7 review PRs by very senior people. Most of the suggestions are meh, but usually there's at least 1 or 2 that improve the code base unequivocally.
And humans produce 100% reasonable code or what? The kind of mess me and everyone I've worked with produces by hand is the inverse of that. Constant shortcuts and lazy slop through and through. Never worked anywhere where the code wasn't an entangled disarray.
As soon as requirements change the abstractions fall apart and everything gets shoehorned.
The only people having success with LLMs right now are people who don't actually care about quality. Anyone who cares about producing good work recognized a long time ago that LLMs are not fit for purpose, and isn't relying on them.
Funny how you spend your days spreading this nonsense, like if someone would deny reality just because you keep repeating it. Everyone knows that what you're saying isn't true, so you're wasting your time.
It’s honestly kinda sad watching otherwise smart people have their cognition completely hijacked by their fear and insecurity. I had multiple comments on this post get downvoted because I shared my positive experience working with AI in my own company. Apparently that’s too threatening to some, who spend their time insulting the skills and intelligence of people who work with a technology they find threatening, rather than engaging their curiosity to understand what it’s good for and what it isn’t. Sad.
No, humans don't produce 100% reasonable code, but the nature of human mistakes, foibles and unreasonableness is very different from the kind slop farming yields.
> Picking among them isn’t a matter of context. It’s a matter of judgment, and the models - not the harnesses - get this judgment wrong far too often. I would say no better than random chance.
Yeah I’m currently working for several months already on a harness that wraps Claude Code and Codex etc to ensure that these types of invariants are captured and enforced (after the first few harness attempts failed), and - while it’s possible - slows down the workflow significantly and burns a lot more tokens. In addition to requiring more human involvement, of course.
I suspect this is the right direction, though, as the alternatives inevitably lead any software project to delve into a spaghetti mess maintenance nightmare.
It's not enough to enforce the invariants because they may need to change. You need to follow the invariants when they're right, and go back and reconsider them when they prove unhelpful. Knowing which is the case requires judgment that today's models are simply incapable of (not consistently, at least).
Yeah that's what I mean with "more human involvement", so the approach is to put the human in the loop in these moments, and the LLM knowing when it should do this.
Yea, happend to me as well, I left my agent to write code, it went down a rabbit hole of solving a typescipt error and ended up removing the package's type files to remove the error from source. lol!
I read all the code I generate with Cursor and some of it smells a bit weird but is easily fixable and most of it is as good as what I would write or better.
I read a bunch of Claude-Code-generated code last week and I was pretty impressed. It followed the established service class paradigm almost as exactly as we'd originally intended. The code was mostly very clean and had copious comments. A big step up from 2025 code.
For the record, I definitely don't immediately read the majority of code Claude writes these days. I just check on it periodicially. In terms of code quality it's as good as any human I know of.
What's the difference between asking an AI to write you a module you never read and installing a 3rd-party module without auditing all its source code?
If the 3rd party module is popular, its badness will affect other people too and either the module will get improved or well known workarounds/"best practices" will develop. With AI-generated code, more often than not you're the sole user.
I would use Stripe, curl, and ffmpeg without audits, because I trust them to provide good code and to respect their API. I wouldn’t trust AI to write a Fibonacci series implementation.
yes it all comes back to iteration, the original "vibe coding". for me, programming has always been about making it up as i go along. like an artist starts with one stroke, i started when i was 10 years old typing '10 print "hello, world" 20 goto 10' and i've never really stopped programming that way 47 years later. For me programming is the same as refactoring, they both happen in a continuous Zone throughout the day. The idea of spending this big period at the beginning Defining the Architecture then letting AI fill in the blanks makes no sense because I only know what the architecture is, what the product is, as part of a process of typing all day for days and weeks and months, that never ends.
> "Yep. The only people I've heard saying that generated code is fine are those who don't read it."
I review every line of code I generate with AI. I mainly use an MR-based approach:
1) Provide a tightly scoped technical spec to Codex as a task, and ask for 3x solutions. Usually at least one of them is on the right track, and it is better to ditch a solution that went in the wrong direction than to try to fix it.
2) Review the explanation and diff of the proposed changes line by line, file by file. If I find minor deviations from what I asked, or violations of the codebase architecture/conventions, I write comments in the diff and/or global comments, and ask again for 3x adjusted solutions.
3) Usually, by this point, the solution is ready for me to merge locally and either run local tests or do some manual fine-tuning.
4) Finally, I generate unit tests. I leave them to this stage because I can repeat the same process with the sole intent of generating case-specific unit tests. This way, I can generate/review tests against the final version of the implementation.
This has been working very well for me since our repos are reasonably organized and have a well-defined architecture. In the technical spec, I include the major architectural requirements and code conventions, and I also add a catch-all like "follow the codebase's existing conventions and style", which works reasonably well.
This simple process has enabled me to deliver most minor/medium tasks and bug fixes really quickly while maintaining control over the changes and without lowering the quality bar. For larger and more challenging tasks, I find myself "driving the wheel" (i.e. coding by hand) more often, and using AI code generation in a much more scoped and specific way. So that becomes a different process altogether.
Hilarious to see the insecure AI doomers downvote personal experience comments like this because they don’t fit with their “AI is useless garbage” takes. I used to respect engineers as a class, because I thought we were more rational. Turns out we’re just as likely to be driven by fear and insecurity as anyone else.
This is the rule I have settled on and I can feel why. Writing the first buggy working version with agents is always fun. Then making the software reliable with the agents, the way you want is very painful.
it's not a solved problem but it's not impossible to keep it at bay either. I created this tool for my own project and it does a pretty darn good job at keeping the AI accountable, I have a harness that runs this in a loop and helps refactor as we go like humans do anyways:
I've been coding for 50 years. When I write code, I think it is great work. About five years later, I realize it was crap. This is true of all the code I write.
So, about five years later is the right time for refactoring.
P.S. It takes about five years to forget what you thought you were doing with that code, and see the reality of what you wrote.
Write your code by hand, but AI still serves as something of a stack overflow and code completion tool. Also good for writing tedious things like regex or little one-off utility scripts as well as a first crack at unit tests. Using it to actually write big blocks of important code is a no-no in my opinion as it produces what I would characterize as slop, even if it technically works.
Code that delivers everything that it asks for and more is fine! This has always been the case, it has always been, "If it looks good, it is good." You are an entrepreneur too, you know this in your heart of hearts.
I'm sure you agree broadly with Gabe Newell, "people who don't know how to program who use AI to scaffold their programming abilities will become more effective developers of value than people who've been programming, y'know, for a decade." Look, he's talking about you and me. Programming for a while is quickly becoming worthless. It is of course the journey of programming that gives some people insight to real problems - business, creative, whatever - so it is extra important that the people with the best programming skills use the chatbots to write a lot of code that you and I will absolutely never read.
And anyway, you, as consumer, are constantly using code you have never read. Lots of code is shipped that we never read. There is nothing special about reading code. Even if you and I learned everything by reading code, it doesn't mean that generated code isn't going to create value. It's going to generate tons and tons of value.
Yet another POV is, if you are making code for customers who need to read the code, you are making a mistake, in the long term. It is a very, very interesting way to think about efforts around SBOM and various security companies - a far more informative lens to look at Wiz or Cloudflare, and what value they actually provide, because it's not code - and how relatively little enterprise value the "we read everything" teams at high frequency trading startups really deliver. You know this, you know exactly what I am talking about, it's your experience, so it is surprising to hear from you, talking in generalities against a trend that is obviously coming for all the best programmers.
Yeah, their statement just isn't true. With enough instruction, I've been able to get great output from models. I think that's the key: with detailed, pointed instructions, the output will match.
Indeed, I'm not using LLM output without thorough review.
After reading a bunch of other comments, it sounds like people are referring to letting agents go wild and code whatever off a limited prompt. I'm not using LLMs like that; I'm generally interacting only via conversations with pretty detailed initial prompts. My interactions with the chat after that are corrections/guiding prompts to keep it on point and edit the prompt output from time to time.
This is a fun new position to move the goalposts too, I suppose, but it doesn’t make much sense to me. If I can use AI to plan, implement, test, document, refine, release, and maintain a feature, with review, in 20% of the time it would take me without AI, how exactly does it “defeat the purpose” of using an LLM? My purpose for using the LLM is to solve problems faster, and it does that. What’s yours?
Exactly this. So far it's helped identify blind spots in my thinking, as well as educate me further on the techniques and frameworks I'm already using. It's been tremendously helpful at developing very well thought out _and tested_ software.
They are nowhere near solved. Agents make serious mistakes in judgment and do it frequently enough to threaten the viability of the codebase unless you slow down and monitor them very, very closely. If you do that, it's all good. If you're not, your codebase is rotting at a superhuman speed underneath you and you have no idea until it collapses.
I agree they make mistakes in judgement, that's the whole point of plan mode. That judgement comes to the surface before lots of tokens are wasted without sight of the overall solution.
It's all very simple. "Use x library, data model should be xyz, do m, not n."
They're obviously not at the point of replacing an experienced programmer as far as knowing the start-to-finish way of accomplishing every detail, that's what the human is for.
Plan mode improves results, but it doesn't solve the underlying problems. Pretty often Claude Opus 4.7 on xhigh will formulate a reasonable enough plan, churn for a while, then come back with a summary that it didn't stick to the plan because it wasn't accurate.
Worse, the disclaimer is buried under a bunch of "did X, did Y on line Z of file a/b/c", as if it's just a minor inconvenience. To the extent the plan was inaccurate, you're left in an undefined state where you might as well undo what it just did..
You have to review the plan and fill in any missing gaps or correct anything that's wrong. Plan mode often isn't one shot, it might take a few iterations, but once the plan is nailed down, the results are usually very good.
You're right. I think having it spawn lots of subagents, read everything, formulate a big and detailed plan, only for it to be subtly wrong while requiring me to carefully review the result and the intermediate plans that produced it is quite tiring. I suppose things slip through.
If you understand these subtle pieces you perceive the AI to get wrong, you should include that in your prompt. Also, unit test and functional test coverage go a long way to ensure correct behavior.
I could also include the correct implementation for it to copy in the prompt, if you get what I'm trying to say. Some amount of laziness or vagueness in the prompt is an intended use case, it's surely the point of having the subagents do so much churning of tokens to research before writing the plan that I'm about to disregard. But sure, those are helpful tips.
I admire your perseverance, but in mid-2026, I think you’re wasting your breath. The engineers who are virulently anti-AI like this, without being able to engage honestly about the pros and cons, are being driven by their fear and insecurity.
How am I being virulently anti-AI? I've been a Claude Code max subscriber for many months and find it very helpful. It feels a little unfair to conclude that any criticism is just unfounded fear and insecurity..
The problem is that the mitigations offered in the article also don't work for long. When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it’s small, like the selection of a data structure. You can tell the agent what the constraints are with something like "Views do NOT access other views' state" as the post does.
Except, eventually, you'll want to add a feature that clashes with that invariant. At that point there are usually three choices:
- Don’t add the feature. The invariant is a useful simplifying principle and it’s more important than the feature; it will pay dividends in other ways.
- Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.
- Go back and change the invariant. You’ve just learnt something new that you hadn’t considered and puts things in a new light, and it turns out there’s a better approach.
Often, only one of these is right. Often, at least one of these is very, very wrong, and with bad consequences.
Picking among them isn’t a matter of context. It’s a matter of judgment, and the models - not the harnesses - get this judgment wrong far too often. I would say no better than random chance.
Even if you have an architecture in mind, and even if the agent follows it, sooner or later it will need to be reconsidered. What I've seen is that if you define the architectural constraints, the agent writes complex, unmaintainable code that contorts itself to it when it needs to change. If you don't read what the agent does very carefully - more carefully than human-written code because the agent doesn't complain about contortious code - you will end up with the same "code that devours itself", only you won't know it until it's too late.