Hacker Newsnew | past | comments | ask | show | jobs | submit | skapadia's commentslogin

"Test harness is everything, if you don't have a way of validating the work, the loop will go stray"

This is the most important piece to using AI coding agents. They are truly magical machines that can make easy work of a large number of development, general purpose computing, and data collection tasks, but without deterministic and executable checks and tests, you can't guarantee anything from one iteration of the loop to the next.


Agents run tools in a loop.

The ability to test their work reliably is a tool, if you don't give them that, it's kinda silly to expect any kind of quality output.


Claude Opus 4.5 by far is the most capable development model. I've been using it mainly via Claude Code, and with Cursor.

I agree anticompetitive behavior is bad, but the productivity gains to be had by using Anthropic models and tools are undeniable.

Eventually the open tools and models will catch up, so I'm all for using them locally as well, especially if sensitive data or IP is involved.


I'd encourage you to try the -codex family with the highest reasoning.

I can't comment on Opus in CC because I've never bit the bullet and paid the subscription, but I have worked my way up to the $200/month Cursor subscription and the 5.2 codex models blow Opus out of the water in my experience (obviously very subjective).

I arrived at making plans with Opus and then implementing with the OpenAI model. The speed of Opus is much better for planning.

I'm willing to believe that CC/Opus is truly the overall best; I'm only commenting because you mentioned Cursor, where I'm fairly confident it's not. I'm basing my judgement on "how frequently does it do what I want the first time".


Thanks, I'll try those out. I've used Codex CLI itself on a few small projects as well, and fired it up on a feature branch where I had it implement the same feature that Claude Code did (they didn't see each other's implementations). For that specific case, the implementation Codex produced was simpler, and better for the immediate requirements. However, Claude's more abstracted solution may have held up better to changing requirements. Codex feels more reserved than Claude Code, which can be good or bad depending on the task.


This makes a lot of sense to me.

I've heard Codex CLI called a scalpel, and this resonates. You wouldn't use a scalpel for a major carving project.

To come back to my earlier comment, though, my main approach makes sense in this context. I let Opus do the abstract thinking, and then OpenAI's models handle the fine details.

On a side note, I've also spent a fair amount of time messing around around in Codex CLI as I have a Pro subscription. It rapidly becomes apparent that it does exactly what you tell it even if an obvious improvement is trivial. Opus is on the other end of the spectrum here. you have to be fairly explicit with Opus intructing it to not add spurious improvements.


"To come back to my earlier comment, though, my main approach makes sense in this context. I let Opus do the abstract thinking, and then OpenAI's models handle the fine details."

Very interesting. I'm going to try this out. Thanks!


I've tried nearly all the models, they all work best if and only if you will never handle the code ever again. They suck if you have a solution and want them to implement that solution.

I've tried explaining the implementation word and word and it still prefers to create a whole new implementation reimplementing some parts instead of just doing what I tell it to. The only time it works is if I actually give it the code but at that point there's no reason to use it.

There's nothing wrong with this approach if it actually had guarantees, but current models are an extremely bad fit for it.


Yes, I only plan/implement on fully AI projects where it's easy for me to tell whether or not they're doing the thing I want regardless of whether or not they've rewritten the codebase.

For actual work that I bill for, I go in with intructions to do minimal changes, and then I carefully review/edit everything.

That being said, the "toy" fully-AI projects I work with have evolved to the point where I regularly accomplish things I never (never ever) would have without the models.


There are domains of programming (web front end) where lots of requests can be done pretty well even when you want them done a certain way. Not all, but enough to make it a great tool.


> Claude Opus 4.5 by far is the most capable development model.

At the moment I have a personal Claude Max subscription and ChatGPT Enterprise for Codex at work. Using both, I feel pretty definitively that gpt-5.2-codex is strictly superior to Opus 4.5. When I use Opus 4.5 I’m still constantly dealing with it cutting corners, misinterpreting my intentions and stopping when it isn’t actually done. When I switched to Codex for work a few months ago all of those problems went away.

I got the personal subscription this month to try out Gas Town and see how Opus 4.5 does on various tasks, and there are definitely features of CC that I miss with Codex CLI (I can’t believe they still don’t have hooks), but I’ve cancelled the subscription and won’t renew it at the end of this month unless they drop a model that really brings them up to where gpt-5.2-codex is at.


I have literally the opposite experience and so does most of AI pilled twitter and the AI research community of top conferences (NeurIPS, ICLR, ICML, AAAI) Why does this FUD keep appearing on this site?

Edit: It's very true that the big 4 labs silently mess with their models and any action of that nature is extremely user hostile.


Probably because all of the major providers are constantly screwing around with their models, regardless of what they say.


It feels very close to a trade-off point.

I agree with all posts in the chain: Opus is good, Anthropic have burned good will, I would like to use other models...but Opus is too good.

What I find most frustrating is that I am not sure if it is even actual model quality that is the blocker with other models. Gemini just goes off the rails sometimes with strange bugs like writing random text continuously and burning output tokens, Grok seems to have system prompts that result in odd behaviour...no bugs just doing weird things, Gemini Flash models seem to output massive quantities of text for no reason...it is often feels like very stupid things.

Also, there are huge issues with adopting some of these open models in terms of IP. Third parties are running these models and you are just sending them all your code...with a code of conduct promise from OpenRouter?

I also don't think there needs to be a huge improvement in models. Opus feels somewhat close to the reasonable limit: useful, still outputs nonsense, misses things sometimes...there are open models that can reach the same 95th percentile but the median is just the model outputting complete nonsense and trying to wipe your file system.

The day for open models will come but it still feels so close and so far.


Exactly. We're headed for a discontinuity, not an inflection point.


Hey HN! In 2025, I've spent more time than ever conversing with AI coding agents, particularly Claude Code. These conversations are an intimate look into how we think and solve problems. Every chat with the agent contains valuable solutions, patterns, decisions, and mistakes. So being able to search, analyze, and learn from those interactions isn't just convenient, it's becoming essential.

To help me do this, I built a tool to process Claude Code conversations:

https://github.com/sujankapadia/claude-code-analytics

* Import and search your entire conversation history across projects

* Analyze sessions, choosing from over 300 LLM models, via OpenRouter, to extract insight and patterns (decisions made, error patterns, how you use AI agents)

* Share insights as GitHub Gists (as long as the text passes a security scan)

* View basic aggregate statistics on Claude Code usage

The tool is built with Python, Streamlit, SQLite with FTS5, OpenRouter, and Gitleaks.

I made this for myself, and sharing it in case it helps you too. Once your conversations are in a database, you can start asking questions like “What were the key technical decisions on this project?”, “How did the agent help to research and prototype this feature?”, "What steps did I take to implement this?" and “What errors does the agent commonly make?”

It’s a work in progress, and I'm planning on adding more features. Currently only tested on macOS 14.7 with Claude Code 2.0.21. If you’re curious what your Claude Code sessions may reveal, take it for a spin!


Exactly. Prompt + Tool + External Dataset (API, file, database, web page, image) is an extremely powerful capability.


But the AI coding agent can then ask you follow up questions, consider angles you may not have, and generate other artifacts like documentation, data generation and migration scripts, tests, CRUD APIs, all in context. If you can reliably do all that from plain pseudo code, that's way less verbose than having to write out every different representation of the same underlying concept, by hand.

Sure, some of that, like CRUD APIs, you can generate via templates as well. Heck, you can even have the coding agent generate the templates and the code that will process/compile them, or generate the code that generates the templates given a set of parameters.


1000% agree.


I expect less time spent on boilerplate and documentation, and more time spent on iterating, experimenting, and increasing customer satisfaction. I also wouldn't accept "I don't know how to do that" as an answer. Instead, I'd encourage "I don't know how to do that, but I can use AI to learn faster, and also seek out someone with experience to help review my work".


Add LLM-powered chat to your app. Translate English into executable JSON commands using just TypeScript, Node, and OpenAI—no frameworks! Start extremely simple and get better at prompt engineering, before delving into things like tool calls and MCP servers.


Exactly. This is precisely what I do.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: