Anecdote time! I had Codex GPT 5.4 xhigh generate a Rust proc macro. It's pretty straightforward: use sqlparser to parse a SQL statement and extract the column names of any row-producing queries.
It generated an implementation that worked well, but I hated the ~480 lines of code. The structure and flow was just... weird. It was hard to follow and I was seriously bugged by it.
So I asked it to reimplement it with some simplifications I gave it. It dutifully executed, producing a result >600 lines long. The flow was simpler and easier to follow, but still seemed excessive for the task at hand.
So I rolled up my sleeves and started deleting code and making changes manually. A little bit later, I had it down to <230 lines with a flow that was extremely easy to read and understand.
So yeah, I can totally see many SWE-bench-passing PRs being functionally correct but still terrible code that I would not accept.
If you've got some time, I highly recommend going through the exercise of trying to change the prompt in a way that would produce code similar to what you've achieved manually. Doing a similar exercise really helps to improve agent prompting skills, as it shows how changing parts of the prompt influences the result.
I haven’t had any luck prompting LLMs to “have taste.” They seem to over fixate on instructions (e.g. golfing when asked for concise code) or require specifying so many details and qualifications that the results no longer generalize well to other problems.
Do you have any examples or resources that worked well for you?
Yeah prompting doesn't work for this problem because the entire point of an LLM is you give it the what and it outputs the how. The more how that you have to condition it with in the prompt, the less profitable the interaction will be. A few hints is OK, but doing all the work for the LLM tends to lead to negative productivity.
Writing prompts and writing code takes about the same amount of time, for the same amount of text, plus there's the extra time that the LLM takes to accomplish the task, and review time afterwards. So you might as well just write the code yourself if you have to specify every tiny implementation detail in the prompt.
Makes me think of this commitstrip comic: https://i.xkqr.org/itscalledcode.jpg (mirrored from the original due to TLS issues with the original domain.)
A guy with a mug comes up to a person standing with their laptop on a small table. The mug guy says, "Some day we won't even need coders any more. We'll be able to just write the specification and the program will write itself."
Guy with laptop looks up. "Oh, wow, you're right! We'll be able to write a comprehensive and precise spec and bam, we won't need programmers any more!"
Guy with mug takes a sip. "Exactly!"
Guy with laptop says, "And do you know the industry term for a project specification that is comprehensive and precise enough to generate a program?"
You know, this makes me wonder... is anybody actually prompting LLMs with pseudocode rather than an English specification? Could doing so result in code that that's more true to the original pseudocode?
You can give the macro-structure using stubs then ask the LLM to fill in the blanks.
The problem is that it doesn't work too well for the meso-structure.
Models tend to be quite good at the micro-structure because they've seen a lot of it already, and the macro-structure can easily be promoted, but the levels in between are what distinguishes a good vs bad model (or human!).
Goodhart's Law of Specification: When a spec reaches a state where it's comprehensive and precise enough to generate code, it has fallen out of alignment with the original intent.
Of course there are some systems where correctness is vital, and for those I'd like a precise spec and proof of correctness. But I think there's a huge bulk of code where formal specification impedes what should be a process of learning and adapting.
My dream antiprogram is a specification compiler that interprets any natural language and compiles it to a strict specification. But on any possible ambiguity it gives an error.
?
This terse error was found to be necessary as to not overwhelm the user with pages and pages of decision trees enumerating the ambiguities.
Openspec does this. But instead of "?" it has a separate Open Questions section in the design document. In codex cli, if you first go in plan mode it will ask you open questions before it proceeds with the rest.
The UX is there, for small things it does work for me, but there is still something left for LLMs to truly capture major issues.
> Do you have any examples or resources that worked well for you?
Using this particular example, if you simply paste the exact code into the prompt, the model should able to reproduce it. Now, you can start removing the bits and see how much you can remove from the prompt, e.g. simplify it to pseudocode, etc. Then you can push it further and try to switch from the pseudocode to the architecture, etc.
That way, you'll start from something that's working and work backwards rather than trying to get there in the absence of a clear path.
That’s an interesting approach, but what do you learn from it that is applicable to the next task? Do you find that this eventually boils down to heuristics that generalize to any task? It sounds like it would only work because you already put a lot of effort into understanding the constraints of the specific problem in detail.
What worked for me was Gemini 3 Pro (I guess 3.1 should work even better now) with the prompt "This code is unnecessarily complicated. Simplify it, but no code golf". This decreased code size by about 60 %. It still did a bit of code-golfing, but it was manageable.
It is important to start a new chat so the model is not stuck in its previous mindset, and it is beneficial to have tests to verify that the simplified code still works as it did before.
Telling the model to generate concise code did not work for me, because LLMs do not know beforehand what they are going to write, so they are rarely able to refactor existing code to break out common functionality into reusable functions. We might get there eventually. Thinking models are a bit better at it. But we are not quite there yet.
I have a stupid solution for this which is working wonders. It does not help to tell the LLM "don't do this pattern". I literally make it write a regex based test which looks for that pattern and fails the test.
For example I am developing a game using GDscript, LLMs (including codex and claude) keep making scripts with no classnames and then loading them with @preload. Hate this, and its explicitly mentioned in my godot-development skill. What agents can't stand is a failing test. Feels a bit like enforcing rules automatically.
This is a stupid idea but it works wonders on giving taste to my LLM. I wonder if I should open source that test suite for other agentic developers.
I really should spend some time analyzing what I do to get the good output I get..
One thing that is fairly low effort that you could try is find code you really like and ask the model to list the adjectives and attributes that that code exhibits. Then try them in a prompt.
With LLMs generally you want to adjust the behavior at the macro level by setting things like beliefs and values, vs at the micro level by making "rules".
By understanding how the model maps the aspects that you like about the code to language, that should give you some shorthand phrases that give you a lot of behavioral leverage.
Edit:
Better yet.. give a fresh context window the "before" and "after" and have it provide you with contrasting values, adjectives, etc.
Concise isn't specific enough: I've primed mine on basic architecture I want: imperative shell/functional core, don't mix abstraction levels in one function, each function should be simple to read top-to-bottom with higher level code doing only orchestration with no control flow. Names should express business intent. Prefer functions over methods where possible. Use types to make illegal states unrepresentable. RAII. etc.
You need to think about what "good taste " is to you (or find others who have already written about software architecture and take their ideas that you like). People disagree on what that even means (e.g. some people love Rails. To me a lot of it seems like the exact opposite of "good taste").
I spend much more time refactoring that creating features (though, it is getting better with each model). My go-to approach is to use Claude Code Opus 4.6 for writing and Gemini 3.1 Pro for cleaning up. I feel that doing it just one-stage is rarely enough.
A lot of prompts about finding the right level of abstraction, DRY, etc.
I actually don’t think golfing is such a bad thing, granted it will first handle the low hanging fruits like variable names etc, but if you push it hard enough it will be forced to think of a simpler approach. Then you can take a step back and tell it to fix the variable names, formatting etc. With the caveat that a smaller AST doesn’t necessarily mean simpler code, but it’s a decent heuristic.
I appreciate that your message is a good-natured, friendly tip. I don't mean for the following to crap on that. I just need to shout into the void:
If I have some time, the last thing I want to do with it is sharpen prompting skills. I can't imagine a worse or more boring use of my time on a computer or a skill I want less.
Every time I visit Hacker News I become more certain that I want nothing to do with either the future the enthusiasts think awaits us or the present that they think is building towards it.
While I somewhat understand the impact on the craft, the agents have allowed me to work on the projects that I would never have had enough time to work on otherwise.
You dont need to learn anything, it needs to learn from you. When it fails, don't correct it out of bounds, correct it in the same UI. At the end say "look at what I did and create a proposed memory with what you learned" and if it looks good have it add it to memories.
This better reflects what I thought about the other day. You either, let clankers do its thing and then bake in your implementation on top, you think it through and make them do it, but at the end of the day you've still gotta THINK of the optimal solution and state of the code at which point, do clankers do anything asides from saving you a bunch of keypresses, and maybe catching a couple of bugs?
Also useful to encode into the steering of your platform. The incremental aspect of many little updates really help picking up speed by reducing review time.
Big bang approach could be a start, but a lot of one line guidance from specific things you dont want to see stack up real fast.
My mildly amusing anecdote is that, whenever Claude Code produces something particularly egregious, I often find it sufficient to reply with just "wtf?" for it to present a much more sensible version of the code (which often needs further refinement, but that's another story...)
I reported a similar case of mine several days ago [0]. I was able to achieve better quality than Claude Code's 624 lines of spaghetti code in 334 lines of well-designed code. In a previous case, I rewrote ~400-line LLM generated code in 130 lines.
Had the same problem with a Python project. Just for the hell of it I tried to have it implement a simple version of a proxy I've made in the past. What was finally produced "technically" worked, but it was a mess. It suppressed exceptions all over the place, it did weird shit with imports it couldn't get to work, and the way it managed connection state was bizarre.
It has a third year college students approach to "make it work". It can't take a step back and reevaluate a situation, or determine a new path forward, it just hammers away endlessly with whatever it's trying until it can technically be called "correct".
When I benchmark LLMs on text adventures, they reason like four-year olds but have the worlds largest vocabulary and infinite patience. I'm not surprised this is how they approach programming too.
>It has a third year college students approach to "make it work". It can't take a step back and reevaluate a situation, or determine a new path forward, it just hammers away endlessly with whatever it's trying until it can technically be called "correct".
OH! Yeah I think this is the exact bad feeling I've gotten whenever I've tried testing these things before, except without clear and useful feedback like compiler error messages or something. I remember when I used to code/learn like that early on and...it's not fun now. I also don't think it's really solvable
Yeah it's really funny to watch. They'll get stuck in a specific method call or a specific import. Even if you tell them to read the docs. Doesn't matter if there's a better approach, or that method only exists for some obscure edge case, or the implementation runs counter to the design of the API, if the can hammer the round peg into the square hole, they'll do it.
They also just... Ignore shit. I have explicit rules in the repo I'm using an agent for right now, that day it is for planning and research only, that unless asked specifically it should not generate any code. It still tries to generate code 2 or 3 times a session.
We’re heading for a world of terrible code that can only be maintained by extremely good coding agents and are pretty much impossible for a human to really understand.
The days of the deep expert, who knew the codebase inside out and had it contained in their head, are coming to an end.
> We’re heading for a world of terrible code that can only be maintained by extremely good coding agents and are pretty much impossible for a human to really understand.
I once figured out the algorithm of the program written in one-instruction ISA. I think the instruction was three-address subtraction.
In my opinion, you overestimate the ability of coding agents to, well, code and underestimate the ability of humans to really understand code.
The chart in the article we discuss appears to plateau if one excludes sample from 2024-07. So, we are not quite heading, we are plateauing, if I may.
Probably more like the long tail of software - software that was created for a particular purpose in a particular domain by a single person in the company who also happened to know programming - maybe just as Excel macros.
I strongly assume the long tail is shifting and expanding now and will eventually mostly be software for one-off purposes authored by people who don't know how to code, and probably have a poor understanding of how it actually works.
Hm, yes, it makes sense. If AI "makes" software more and more composable, then yes, most software will be thin wrapper on some ancient machinery that no one understands :)
I guess in some sense this is already the case. Most developers are not "full stack" (and the job postings that describe a software MacGyver are ridiculed like clockwork), but with AI this is actually becoming more and more possible (and thus normal, or at least normalized). And of course software is eating the world, including itself, so the common problems are all SaaS-ified (and/or FOSS-ified), allowing AI-aided development to offload the instrumental dependencies.
Crappy software that works. Unfortunately for all clanker evangelists, you need humans to review all the clanker spaghetti, manage infra, do firefighting and translate business requirements into working systems (gross oversimplification of what that entails). Just code review alone takes up a huge toll on our bandwidth.
I had a similar experience yesterday. Was working on some async stream extensions. Wrote a couple proofs of concept to benchmark, and picked one based on the results. I almost never use LLMs to write code, but out of curiosity, asked whatever the newest claude is to make it with all the real prod requirements, and it spit out over 400 lines of code, lots of spaghetti, with strange flow and a lot of weird decisions. Wrote it myself with all the same functionality in right around 170 lines.
Also had a similar experience in the past weeks reviewing PRs written with LLMs by other engineers in languages they don't know well, one in rust and one in bash. Both required a lot of rounds of revision and a couple of pairing sessions to get to a point where we got rid of the extraneous bits and made it read normally. I'm glad the tool gave these engineers the confidence to work in areas they wouldn't normally have felt comfortable contributing to, but man do I hate the code that it writes.
Once my code exists and passes test, I generally move on to having it iteratively hunt for bugs, security issues, and DRY code reduction opportunities until it stops finding worthwhile ones.
This doesn't always work as well as I'd like, but largely does enough. Conversely, doing as I go has been a waste of time.
Happens all the time. I usually propose a details structure myself (e.g. do it in three phases, add 3 functions + an orchestrator, make sure structure is valid before writing the function bodies), or iterate on detailed plan before implementing code.
Now some people argue that terrible code is fine nowadays, because humans won't read it anymore...
I wonder why they fail this specific way. If you just let them do stuff everything quickly turns spaghetti. They seem to overlook obvious opportunities to simplify things or see a pattern and follow through. The default seems to be to add more, rather than rework or adjust what’s already in place.
I suspect it has something to do with a) the average quality of code in open source repos and b) the way the reward signal is applied in RL post-training - does the model face consequences of a brittle implementation for a task?
I wonder if these RL runs can extend over multiple sequential evaluations, where poor design in an early task hampers performance later on, as measured by amount of tokens required to add new functionality without breaking existing functionality.
Yeah I've been wondering if the increasing coding RL is going to draw models towards very short term goals relative to just learning from open source code in the wild
To me this seems like a natural consequence of the next-token prediction model. In one particular prompt you can’t “backtrack” once you’ve emitted a token. You can only move forwards. You can iteratively refine (e.g the agent can one shot itself repeatedly), but the underlying mechanism is still present.
I can’t speak for all humans, but I tend to code “nonlinearly”, jumping back and forth and typically going from high level (signatures, type definitions) to low level (fill in function bodies). I also do a lot of deletion as I decide that actually one function isn’t needed or if I find a simpler way to phrase a particular section.
Edit: in fact thinking on this more, code is _much_ closer to a tree than sequence of tokens. Not sure what to do with that, except maybe to try a tree based generator which iteratively adds child nodes.
This would make sense to me as an explanation when it only outputs code. (And I think it explains why code often ends up subtly mangled when moved in a refactoring, where a human would copy paste, the agent instead has to ”retype” it and often ends up slightly changing formatting, comments, identifiers, etc.)
But for the most part, it’s spending more tokens on analysis and planning than pure code output, and that’s where these problems need to be caught.
I feel like planning is also inherently not sequential. Typically you plan in broad strokes, then recursively jump in and fill in the details. On the surface it doesn’t seem to be all that much different than codegen. Code is just more highly specified planning. Maybe I’m misunderstanding your point?
I think your average person knows what sequential means but might not remember what series means. Personally I always remember the meaning of series in “parallel vs series” because it must be the opposite of parallel. I’m not proud of the fact that I always forget and have to re-intuit the meaning every time, but the only time I ever see “series” is when people are talking about a TV show or electronics.
Oh. Indeed you're correct. I was thinking in computer terms instead of scientific terms. Personally I see this as reinforcing that computers as a context wouldn't really benefit from using "proper" SI.
Note that no one is going to confuse mB for millibytes because what would that even mean? But also in practice MB versus Mb aren't ambiguous because except for mass storage no one mixes bytes with powers of ten AFAIK.
And let's take a minute to appreciate the inconsistency of (SI) km vs Mm. KB to GB is more consistent.
> no one is going to confuse mB for millibytes because what would that even mean?
Data compression. For example, look at http://prize.hutter1.net/ , heading "Contestants and Winners for enwik8". On 23.May'09, Alex's program achieved 1.278 bits per character. On 4.Nov'17, Alex achieved 1.225 bits per character. That is an improvement of 0.053 b/char, or 53 millibits per character. Similarly, we can talk about how many millibits per pixel JPEG-XL is better than classic JPEG for the same perceptual visual quality. (I'm using bits as the example, but you can use bytes and reach the same conclusion.)
Just because you don't see a use for mB doesn't mean it's open for use as a synonym of MB. Lowercase m means milli-, as already demonstrated in countless frequently used units - millilitre, millimetre, milliwatt, milliampere, and so on.
In case you're wondering, mHz is not a theoretical concept either. If you're generating a tone at say 440 Hz, you can talk about the frequency stability in millihertz of deviation.
Touche! I had no idea that term was in use. That said, I remain unconvinced that there is any danger of confusion here. Benchmarking compression algorithms is awfully specific; it's normal for fields to have their own jargon and conventions.
> Just because you don't see a use for mB doesn't mean it's open for use as a synonym of MB.
At the end of the day it's all down to convention. We've never needed approval from a standards body to do something. Standards are useful to follow when they provide a tangible benefit; following them for their own sake to the detriment of something immediately practical is generally a waste of time and effort.
I don't believe I hallucinated unit notations such as mB and gB. Unfortunately I don't immediately recall where I encountered their use.
> In case you're wondering, mHz is not a theoretical concept either.
Just to be clear, I was not meaning to suggest that non-SI prefixes be used for quantifying anything other than bits. SI standardized prefixes are great for most things.
I implemented a rational number library for media timestamps (think CMTime, AVRational, etc.) that uses 64-bit numerators and denominators. It uses 128-bit integers for intermediate operations when adding, subtracting, multiplying, etc. It even uses 128-bit floats (represented as 2 doubles and using double-double arithmetic[1]) for some approximation operations and even 192-bit integers in one spot (IIRC it's multiplying a 128-bit and 64-bit ints and I just want the high bits so it shifts back down to 128 bits immediately after the multiplication).
I keep meaning to see if work will let me open source it.
> Would you pay for source-available products? GPL and paid license?
I'm not GP but I would at least consider it. I say that as someone who refuses to build on closed-source tooling or libraries. I'd even consider closed-source if there was an irrevocable guarantee that the source would be released in its entirety (with a favorable open source license) if the license/pricing terms ever changed or the company ceased to exist or stopped supporting that product.
> Along with a guarantee that you get to keep access to older versions (Jetbrains and Sublime Text models)?
I like that for personal tools but I wouldn't build my products or business on top of those. I've had too much trouble getting old binaries to work on new OS versions to consider these binaries to be usable in the long term.
I'll check that out. The goal is to get to something that runs all night (or almost all night) with around 1kwh output using as little space as possible. I've just started poking around, but this'll help.
In the third world there's plenty of sunlight, but you don't need the power during the day necessarily. That price'll get to $400 for storage, $400 for panels, which is ballpark.
GP only has two panels that generate 960 W (I’m going to generously assume NMOT and not STC). That’s hardly anything, and certainly not what I would use to try and charge 10 kWh of battery like they’re suggesting.
But sure, I agree it would help if battery prices came down.
During the day when nobody's home the panels are charging the battery.
Obviously more panels are better.
The goal is to be able to run a small window AC unit and various small appliances at night. That's a tremendous quality-of-life upgrade for a huge number of people. $1000 USD would make it somewhat affordable, in the window for a viable small business/NGO opportunity. There's obviously a whole lot more (installation, labor, maintenance, etc), but material cost needs to be low for it to work.
That story doesn’t work for people with depression who otherwise have very good lives.
I grew up in a stable household with a loving family and both parents present and supportive. I’ve never had financial hardship, either as a kid depending on my parents to provide or as an adult providing for myself and family. I did very well in school, had plenty of friends, never had enemies, never got bullied or even talked bad about in social circles (so far as I know…). I have no traumatic memories.
I could go on and on, but despite having a virtually perfect life on paper, I have always struggled with depression and suicidal ideation. It wasn’t until my wife sat down and forced me to talk to a psychiatrist and start medication that those problems actually largely went away.
In other words, I don’t think there’s a metaphorical “cow” that could have helped me. It’s annoying we don’t understand what causes depression or how antidepressants help, and their side effects suck. But for some of us, it’s literally life saving in a way nothing else has ever been.
First of all, I want to write that I am glad you found something that worked so that you are able to remain here with us.
Though, I am curious about the, "otherwise have very good lives" part.
Whose definition are you using? It seems the criteria you laid out fits a "very good life" in a sociological sense -- very important, sure. You could very well have the same definition, and perhaps that is what I am trying to ask. Would you say you were satisfied in life? Despite having a good upbringing, were you (prior to medication) content or happy?
I am by no means trying to change your opinion nor invalidate your experiences. I just struggle to understand how that can be true.
As someone that has suffered with deep depressive bouts many times over, I just cannot subscribe to the idea that depression is inherently some sort of disorder of the brain. In fact, I am in the midst of another bout now. One that's lasted about 3 or so years.
To me, I have always considered emotions/states like depression and anxiety to be signals. A warning that something in one's current environment is wrong -- even if consciously not known or difficult to observe. And if anyone is curious, I have analyzed this for myself, and I believe the etiology of my issues are directly linked to my circumstances/environment.
> I don’t think there’s a metaphorical “cow” that could have helped me.
The smart-ass in me can't help but suggest that maybe medication was your cow?
To be honest, I've never really thought about it... I suppose I mean in both a sociological and self fulfillment way.
> Would you say you were satisfied in life? Despite having a good upbringing, were you (prior to medication) content or happy?
I would say "yes" overall. Aside from the depression (typically manifesting as a week or two of me emotionally spiraling down to deep dark places every month or so), I was very happy and satisfied. That's what makes the depression so annoying for me. It makes no sense compared to my other aspects of life.
> In fact, I am in the midst of another bout now. One that's lasted about 3 or so years.
*fist bump*
> To me, I have always considered emotions/states like depression and anxiety to be signals. A warning that something in one's current environment is wrong -- even if consciously not known or difficult to observe. And if anyone is curious, I have analyzed this for myself, and I believe the etiology of my issues are directly linked to my circumstances/environment.
I think that's a great hypothesis so long as it's not a blanket applied to everyone (which I don't think you're doing, to be clear; I mention this only because it is what motivated my original response to the other commenter).
I don't want to go into private details of family members without their permission, but I will say that given the pervasive depression in my family and mental health issues like schizophrenia and bipolar disorders (neither of which I have, thank goodness), I feel like there's something biologically... wrong (for lack of a better word?)... with us, particularly since you can easily trace this through my mother's side.
> The smart-ass in me can't help but suggest that maybe medication was your cow?
Ha fair. I interpreted the story to be about depression being a symptom of your situation (job, health, etc.) and if you just fixed that then there's no need for medication. That definitely makes sense in some (many? most?) situations. But not all, unfortunately.
Take my baseless speculation for what it's worth, but could it be that you were depressed because your life was too easy? We humans are meant to struggle through adversity. Can you really appreciate your financial security if you've never faced financial insecurity, or appreciate companionship if you've never experienced loneliness?
It’s a reasonable question but I doubt it. We weren’t affluent at all and I worked my butt off for everything. And that’s good, because I agree that if things are too easy it turns into a curse.
> I don’t think there’s a metaphorical “cow” that could have helped me.
The medication is the cow for you. In this story your support system figured out what would work best for you, which was medication, and facilitated that.
It’s a story about a doctor that serves patients in rural Cambodia. Help from the local community would look different in Borey Peng Huoth, for example.
The story in the article that is being discussed here does not say that the doctor was explicitly not a member of the community that he served. You would have to just sort of make that part up and then come up with an explanation for how the doctor even knows that story if he wasn’t part of that community.
The doctor in the story exists in pretty recent history, which you would call modernity. If for some reason you’re using “modernity” as a way to say “systemic alienation of the individual” rather than “modernity” meaning “happening in the modern world” then yes, by your definition of that word, it is indeed a story about “modernity” being to blame for poor treatment for depression.
It generated an implementation that worked well, but I hated the ~480 lines of code. The structure and flow was just... weird. It was hard to follow and I was seriously bugged by it.
So I asked it to reimplement it with some simplifications I gave it. It dutifully executed, producing a result >600 lines long. The flow was simpler and easier to follow, but still seemed excessive for the task at hand.
So I rolled up my sleeves and started deleting code and making changes manually. A little bit later, I had it down to <230 lines with a flow that was extremely easy to read and understand.
So yeah, I can totally see many SWE-bench-passing PRs being functionally correct but still terrible code that I would not accept.
reply