More

vercaemert · 2026-02-06T11:26:30 1770377190

This will be a Harvard Business case study on market share.

Claude Code was instrumental for Anthropic.

What's interesting is that people haven't heard of it/them outside of software development circles. I work on a volunteer project, a webapp basically, and even the other developers don't know the difference between Cursor and Claude Code.

vercaemert · 2026-02-06T11:10:29 1770376229

It's impressive, even if the books and the posts you're talking about were both key parts of the training data.

There are many academic domains where the research portion of a PhD is essentially what the model just did. For example, PhD students in some of the humanities will spend years combing ancient sources for specific combinations of prepositions and objects, only to write a paper showing that the previous scholars were wrong (and that a particular preposition has examples of being used with people rather than places).

This sort of experiment shows that Opus would be good at that. I'm assuming it's trivial for the OP to extend their experiment to determine how many times "wingardium leviosa" was used on an object rather than a person.

(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)

adastra22 · 2026-02-06T11:21:09 1770376869

I don’t think this example proves your point. There’s no indication that the model actually worked this out from the input context, instead of regurgitating it from the training weights. A better test would be to subtly modify the books fed in as input to the model so that there was actually 51 spells, and see if it pulls out the extra spell, or to modify the names of some spells, etc.

In your example, it might be the case that the model simply spits out consensus view, rather than actually finding/constructing this information on his own.

vercaemert · 2026-02-06T11:38:02 1770377882

Ah, that's a good point.

vercaemert · 2026-02-03T19:06:47 1770145607

I'm suprised there isn't more "hope" in this area. Even things like the GPT Pro models; surely that sort of reasoning/synthesis will eventually make its way into local models. And that's something that's already been discovered.

Just the other day I was reading a paper about ANNs whose connections aren't strictly feedforward but, rather, circular connections proliferate. It increases expressiveness at the (huge) cost of eliminating the current gradient descent algorithms. As compute gets cheaper and cheaper, these things will become feasible (greater expressiveness, after all, equates to greater intelligence).

bigfudge · 2026-02-03T21:18:22 1770153502

It seems like a lot of the benefits of SOTA models are from data though, not architecture? Won't the moat of the big 3/4 players in getting data only grow as they are integrated deeper into businesses workflows?

vercaemert · 2026-02-03T21:28:42 1770154122

That's a good point. I'm not familiar enough with the various moats to comment.

I was just talking at a high level. If transformers are HDD technology, maybe there's SSD right around the corner that's a paradigm shift for the whole industry (but for the average user just looks like better/smarter models). It's a very new field, and it's not unrealistic that major discoveries shake things up in the next decade or less.

vercaemert · 2026-02-03T18:58:48 1770145128

I'd encourage you to try the -codex family with the highest reasoning.

I can't comment on Opus in CC because I've never bit the bullet and paid the subscription, but I have worked my way up to the $200/month Cursor subscription and the 5.2 codex models blow Opus out of the water in my experience (obviously very subjective).

I arrived at making plans with Opus and then implementing with the OpenAI model. The speed of Opus is much better for planning.

I'm willing to believe that CC/Opus is truly the overall best; I'm only commenting because you mentioned Cursor, where I'm fairly confident it's not. I'm basing my judgement on "how frequently does it do what I want the first time".

skapadia · 2026-02-03T19:59:30 1770148770

Thanks, I'll try those out. I've used Codex CLI itself on a few small projects as well, and fired it up on a feature branch where I had it implement the same feature that Claude Code did (they didn't see each other's implementations). For that specific case, the implementation Codex produced was simpler, and better for the immediate requirements. However, Claude's more abstracted solution may have held up better to changing requirements. Codex feels more reserved than Claude Code, which can be good or bad depending on the task.

vercaemert · 2026-02-04T14:55:34 1770216934

This makes a lot of sense to me.

I've heard Codex CLI called a scalpel, and this resonates. You wouldn't use a scalpel for a major carving project.

To come back to my earlier comment, though, my main approach makes sense in this context. I let Opus do the abstract thinking, and then OpenAI's models handle the fine details.

On a side note, I've also spent a fair amount of time messing around around in Codex CLI as I have a Pro subscription. It rapidly becomes apparent that it does exactly what you tell it even if an obvious improvement is trivial. Opus is on the other end of the spectrum here. you have to be fairly explicit with Opus intructing it to not add spurious improvements.

skapadia · 2026-02-04T20:39:00 1770237540

"To come back to my earlier comment, though, my main approach makes sense in this context. I let Opus do the abstract thinking, and then OpenAI's models handle the fine details."

Very interesting. I'm going to try this out. Thanks!

eadwu · 2026-02-03T19:09:56 1770145796

I've tried nearly all the models, they all work best if and only if you will never handle the code ever again. They suck if you have a solution and want them to implement that solution.

I've tried explaining the implementation word and word and it still prefers to create a whole new implementation reimplementing some parts instead of just doing what I tell it to. The only time it works is if I actually give it the code but at that point there's no reason to use it.

There's nothing wrong with this approach if it actually had guarantees, but current models are an extremely bad fit for it.

vercaemert · 2026-02-03T19:17:58 1770146278

Yes, I only plan/implement on fully AI projects where it's easy for me to tell whether or not they're doing the thing I want regardless of whether or not they've rewritten the codebase.

For actual work that I bill for, I go in with intructions to do minimal changes, and then I carefully review/edit everything.

That being said, the "toy" fully-AI projects I work with have evolved to the point where I regularly accomplish things I never (never ever) would have without the models.

teaearlgraycold · 2026-02-03T19:14:55 1770146095

There are domains of programming (web front end) where lots of requests can be done pretty well even when you want them done a certain way. Not all, but enough to make it a great tool.

vercaemert · 2026-01-25T11:43:38 1769341418

Personally, I'm fascinated by the opening for protocol languages to become relevant.

The previous generations of AI (AI in the academic sense) like JASON, when combined with a protocol language like BSPL, seems like the easiest way to organize agent armies in ways that "guarantee" specific outcomes.

The example above is very cool, but I'm not sure how flexible it would be (and there's the obvious cost concern). But, then again, I may be going far down the overengineering route.

vercaemert · 2026-01-24T21:00:23 1769288423

I'd be interested to hear some use cases people have for large contexts on an 8B model. Other than sentiment analysis or summarization (this release implies agentic use). My experience with the general intelligence of agentic interactions is that everything is unusable before 32B for any context greater than 4k tokens.

WaalkTheEaarth · 2026-01-27T13:37:58 1769521078

I personally use a 8B model for general use on my laptop lol, it works like a charm and makes sense (most of the time atleast)

vercaemert · 2026-01-23T10:38:23 1769164703

You just need a robust benchmark. As long as you understand your benchmark, you can trust the results.

We have a hard OCR problem.

It's very easy to make high-confidence benchmarks for OCR problems (just type out the ground truth by hand), so it's easy to trust the benchmark. Think accuracy and token F1. I'm talking about highly complex OCR that requires a heavyweight model.

Scout (Meta), a very small/weak model, is outperforming Gemini Flash. This is highly unexpected and a huge cost savings.

Some problems aren't so easily benchmarked.

vercaemert · 2026-01-18T11:10:53 1768734653

I was hoping there'd be more discussion about the model itself. I find the last couple of generations of Pro models fascinating.

Personally, I've been applying them to hard OCR problems. Many varied languages concurrently, wildly varying page structure, and poor scan quality; my dataset has all of these things. The models take 30 minutes a page, but the accuracy is basically 100% (it'll still striggle with perfectly-placed bits of mold). The next best model (Google's flagship) rests closer to 80%.

I'll be VERY intrigued to see what the next 2, 5, 10 years does to the price of this level of model.

vercaemert · 2026-01-10T01:19:32 1768007972

yeah, well said

i like what software can do, i don't like writing it

i can try to give the benefit of the doubt to people saying they don't see improvements (and assume there's just a communication breakdown)

i've personally built three poc tools that proved my ideas didn't work and then tossed the poc tools. ive had those ideas since i knew how to program, i just didn't have the time and energy to see them through.

vercaemert · 2026-01-06T18:59:27 1767725967

How do you compare Claude Code to Cursor? I'm a Cursor user quietly watching the CC parade with curiosity. Personally, I haven't been able to give up the IDE experience.

kaydub · 2026-01-07T18:09:52 1767809392

Im so sold on the cli tools that I think IDEs are basically dead to me. I only have an IDE open so I can read the code, but most often I'm just changing configs (like switching a bool, or bumping up a limit or something like that).

Seriously, I have 3+ claude code windows open at a time. Most days I don't even look at the IDE. It's still there running in the background, but I don't need to touch it.

lizardking · 2026-01-07T01:16:28 1767748588

When I'm using Claude Code, I usually have a text editor open as well. The CC plugin works well enough to achieve most of what Cursor was doing for me in showing real-time diffs, but in my experience, the output is better and faster. YMMV

tstrimple · 2026-01-07T16:44:12 1767804252

I use CC for so much more than just writing code that I cannot imagine being constrained within an IDE. Why would I want to launch an IDE to have CC update the *arr stack on my NAS to the latest versions for example? Last week I pointed CC at some media files that weren't playing correctly on my Apple TV. It detected what the problem formats were and updated my *arr download rules to prefer other releases and then configured tdarr to re-encode problem files in my existing library.

subomi · 2026-01-07T06:00:14 1767765614

I was here a few weeks ago, but I'm now on the CC train. The challenge is that the terminal is quite counterintuitive. But if you put on the Linux terminal lens from a few years ago, and you start using it. It starts to make sense. The form factor of the terminal isn't intuitive for programming, but it's the ultimate.

FYI, I still use cursor for small edits and reviews.

enum · 2026-01-06T19:02:58 1767726178

I don't think I can scientifically compare the agents. As it is, you can use Opus / Codex in Cursor. The speed of Cursor composer-1 is phenomenal -- you can use it interactively for many tasks. There are also tasks that are not easier to describe in English, but you can tab through them.

smw · 2026-01-07T03:43:25 1767757405

Just FYI, these days cc has 'ide integration' too, it's not just a cli. Grab the vscode extension.