I know it's popular comparing coding agents to slot machines right now, but the ...

saulpw · 2026-02-16T20:46:07 1771274767

Yeah I'm finding that there's "clock time" (hours) and "calendar time" (days/weeks/months) and pushing people to work 'more' is based on the fallacy that our productivity is based on clock time (like it is in a factory pumping out widgets) rather than calendar time (like it is in art and other creative endeavors). I'm finding that even if the LLM can crank out my requested code in an hour, I'll still need a few days to process how it feels to use. The temptation is to pull the lever 10 times in a row because it was so easy, but now I'll need a few weeks to process the changes as a human. This is just for my own personal projects, and it makes sense that the business incentives would be even more intense. But you can't get around the fact that, no matter how brilliant your software or interface, customers are not going to start paying in a few hours.

simonw · 2026-02-16T20:55:59 1771275359

> The temptation is to pull the lever 10 times in a row because it was so easy, but now I'll need a few weeks to process the changes as a human

Yeah I really feel that!

I recently learned the term "cognitive debt" for this from https://margaretstorey.com/blog/2026/02/09/cognitive-debt/ and I think it's a great way to capture this effect.

I can churn out features faster, but that means I don't get time to fully absorb each feature and think through its consequences and relationships to other existing or future features.

mrbungie · 2026-02-16T19:59:30 1771271970

If you are really good and fast validating/fixing code output or you are actually not validating it more than just making sure it runs (no judging), I can see it paying out 95% of the time.

But for what I've seen both validating my and others coding agents outputs I'd estimate a much lower percentage (Data Engineering/Science work). And, oh boy, some colleages are hooked to generating no matter the quality. Workslop is a very real phenomenon.

biophysboy · 2026-02-16T20:19:00 1771273140

This matches my experience using LLMs for science. Out of curiosity, I downloaded a randomized study and the CONSORT checklist, and asked Claude code to do a review using the checklist.

I was really impressed with how it parsed the structured checklist. I was not at all impressed by how it digested the paper. Lots of disguised errors.

baq · 2026-02-16T20:59:34 1771275574

try codex 5.3. it's dry and very obviously AI; if you allow a bit of anthropomorphisation, it's kind of high-functioning autistic. it isn't an oracle, it'll still be wrong, but it's a powerful, completely different from claude tool.

biophysboy · 2026-02-16T21:03:23 1771275803

Does it get numbers right? One of the mistakes it made in reading the paper was swapping sets of numbers from the primary/secondary outcomes.

baq · 2026-02-16T21:11:30 1771276290

it does get screenshots right for me, but obviously I haven't tried on your specific paper. I can only recommend trying it out, it's also has a much more generous limits in the $20 tier than opus.

biophysboy · 2026-02-16T21:44:42 1771278282

I see. To clarify, it parsed numbers in the pdf correct, but assigned them the wrong meaning. I was wondering if codex is better at interpreting non text data

enraged_camel · 2026-02-17T10:05:35 1771322735

Every time someone suggests Codex I give it a shot. And every time it disappoints.

After I read your comment, I gave Codex 5.3 the task of setting up an E2E testing skeleton for one of my repos, using Playwright. It worked for probably 45 minutes and in the end failed miserably: out of the five smoke tests it created, only two of them passed. It gave up on the other three and said they will need “further investigation”.

I then stashed all do that code and gave the exact same task to Opus 4.5 (not even 4.6), with the same prompt. After 15 mins it was done. Then I popped Codex’s code from the stash and asked Opus to look at it to see why the three m of the five tests Codex wrote didn’t pass. It looked at them and found four critical issues that Codex had missed. For example, it had failed to detect that my localhost uses https, so the the E2E suite’s API calls from the Vue app kept failing. Opus also found that the two passing tests were actually invalid: they checked for the existence of a div with #app and simply assumed it meant the Vue app booted successfully.

This is probably the dozenth comparison I’ve done between Codex and Opus. I think there was only one scenario where Codex performed equally well. Opus is just a much better model in my experience.

baq · 2026-02-17T11:38:41 1771328321

moral of the story is use both (or more) and pick the one that works - or even merge the best ideas from generated solutions. independent agentic harnesses support multi-model workflows.

enraged_camel · 2026-02-17T12:33:27 1771331607

I don't think that's the moral of the story at all. It's already challenging enough to review the output from one model. Having to review two, and then comparing and contrasting them, would more than double the cognitive load. It would also cost more.

I think it's much more preferable to pick the most reliable one and use it as the primary model, and think of others as fallbacks for situations where it struggles.

baq · 2026-02-17T12:50:16 1771332616

you should always benchmark your use cases and you obviously don't review multiple outputs; you only review the consensus.

see how perplexity does it: https://www.perplexity.ai/hub/blog/introducing-model-council

r00tanon · 2026-02-16T21:40:46 1771278046

I was going to mention Yegge's recent blog posts mirroring this phenomena.

There's also this article on hbr.org https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies...

This is a real thing, and it looks like classic addiction.

fdefitte · 2026-02-16T20:12:11 1771272731

That 95% payout only works if you already know what good looks like. The sketchy part is when you can't tell the diff between correct and almost-correct. That's where stuff goes sideways.

energy123 · 2026-02-17T07:40:17 1771314017

Being on a $200 plan is a weird motivator. Seeing the unused weekly limit for codex and the clock ticking down, and knowing I can spam GPT 5.2 Pro "for free" because I already paid for it.

Retr0id · 2026-02-16T20:15:34 1771272934

It's 95% if you're using it for the stuff it's good at. People inevitably try to push it further than that (which is only natural!), and if you're operating at/beyond the capability frontier then the success rate eventually drops.

Kiro · 2026-02-16T20:26:22 1771273582

Just need to point out that the payout is often above 95% at online casinos. As long as it's below 100 the house still wins.

mikkupikku · 2026-02-16T21:00:21 1771275621

He means a slot machine that pays you 95% of the time, not a slot machine that pays out 95% of what you put in.

Claude Code wasting my time with nonsense output one in twenty times seems roughly correct. The rest of the time it's hitting jackpots.

fy20 · 2026-02-16T21:01:07 1771275667

> It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it

Right but the <100% chance is actually why slot machines are addictive. If it pays out continuously the behaviour does not persist as long. It's called the partial reinforcement extinction effect.

jrflowers · 2026-02-16T20:15:41 1771272941

> It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it.

“It’s not like a slot machine, it’s like… a slot machine… that I feel good using”

That aside if a slot machine is doing your job correctly 95% of the time it seems like either you aren’t noticing when it’s doing your job poorly or you’ve shifted the way that you work to only allow yourself to do work that the slot machine is good at.

globular-toast · 2026-02-17T16:08:40 1771344520

> It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it.

I think you are mistaken on what the "payout" is. There's only one reason someone is working all hours and during a party and whatnot: it's to become rich and powerful. The payout is not "more code", it's a big house, fast cars, beautiful women etc. Nobody can trick it into paying out even 1% of the time, let alone 95%.

zem · 2026-02-17T08:06:09 1771315569

thanks, that steve yegge piece was a very good read.