More

osti · 2026-03-27T03:32:33 1774582353

Doesn't the chat version of chatgpt or gemini also have interleaved tool calls, so do those also count as with harnesses?

WiSaGaN · 2026-03-27T06:01:50 1774591310

Harness is fine. I think people here are arguing what provided here to take the test is not harness.

osti · 2026-03-24T03:25:09 1774322709

Seems like the high compute parallel thinking models weren't even needed, both the normal 5.4 and gemini 3.1 pro solved it. Somehow Gemini 3 deepthink couldn't solve it.

osti · 2026-03-21T18:01:56 1774116116

During flights? Sounds a bit harsh.

cobbzilla · 2026-03-21T18:06:28 1774116388

Have you ever tried to sleep while the person next to you watches a movie at full volume?

furyofantares · 2026-03-21T18:15:36 1774116936

Yeah, it sucks. I agree with you, they should be brutally murdered.

nxpnsv · 2026-03-21T18:22:42 1774117362

That's too harsh, a regular murder would suffice.

sharkweek · 2026-03-21T18:33:08 1774117988

Just put them in row 24 on a Boeing 737 max and let the problem take care of itself.

halapro · 2026-03-21T18:42:01 1774118521

Just open the window

lostlogin · 2026-03-21T18:44:56 1774118696

Boeing tried this new feature.

halapro · 2026-03-21T18:46:58 1774118818

Not a bug, works as intended.

lelanthran · 2026-03-21T18:34:06 1774118046

> That's too harsh, a regular murder would suffice.

Correct. Kicking someone off during a flight and not giving them a parachute counts as a regular murder...

verdverm · 2026-03-21T18:48:57 1774118937

Requisite link to satirical study

"Parachute use to prevent death and major trauma when jumping from aircraft: randomized controlled trial"

https://www.bmj.com/content/363/bmj.k5094

anigbrowl · 2026-03-22T02:57:17 1774148237

It's not murder if they're guilty. Those planes come with doors for a reason.

rendaw · 2026-03-21T19:09:07 1774120147

For all siblings, I think parent was suggesting "while in flight". i.e. dropping them from 30k feet. Hence harsh...

quietsegfault · 2026-03-21T18:03:17 1774116197

NO TICKET

lelanthran · 2026-03-21T18:46:56 1774118816

I wonder how many people got this reference.

Anyway, for those who did not: https://www.youtube.com/watch?v=rCZ86O3PO-U

shagie · 2026-03-21T19:54:44 1774122884

Could have also gone for Dogma (which of course references that clip) https://youtu.be/PpckOsftaP4?si=DDlDY3ZK7FoUcKrn&t=41

throwaway894345 · 2026-03-21T18:13:01 1774116781

Seems like this flew right over a few heads.

widowlark · 2026-03-21T18:17:12 1774117032

and yet the joke fell right into our laps

sebastiennight · 2026-03-21T18:19:42 1774117182

United says we should tone down the sarcasm

Hamuko · 2026-03-21T18:53:49 1774119229

Harsh, but fair.

SOLAR_FIELDS · 2026-03-21T18:59:38 1774119578

Now explain why it wouldn’t also be fair to kick people off that were loudly emitting disgusting flatulence. Is it because they “might” not have control over it? Can I not claim I also “might” not have the control over my impulsive desire to listen to music or that I can’t use headphones for a medical issue?

I mean such a thing I would say equally detracts from the flying experience, so why not also kick those people off?

Edit: not sure why I’m getting downvoted, this is a legitimate question. I genuinely want to hear the justification.

DaSHacka · 2026-03-21T19:09:33 1774120173

You'd have a more convincing argument if you argued for a passenger with Tourette's or something. Bodily functions are obviously different from watching a movie at full volume, because there's never a situation where you would be involuntarily blasting the audio of your show or whatever to the whole plane.

SOLAR_FIELDS · 2026-03-21T19:12:25 1774120345

Okay, Tourette’s then. Should we kick people off for Tourette’s?

Your comment also presupposes two things: that flatulence is always involuntary and blasting music isn’t. Let’s say I have a form of Tourette’s that forces me to involuntarily blast noise and music and I have medical papers to prove it. Is it okay then?

I would absolutely support it if you could demonstrate that those two things are actually true. My point is: Who gets to decide what’s legitimately an involuntary medical issue and what isn’t, and where is the line that demarcates it? And what is the point of this exercise? It’s to prevent people from forcing everyone else to have a worse experience for their own personal gain, which flatulence is a form of that you could argue, so why is blasting music fundamentally different?

recursive · 2026-03-21T19:34:24 1774121664

We're talking about music coming from a phone. Not a person. Just turn the phone off or uninstall tiktok. Or put it in your bag.

vel0city · 2026-03-21T19:53:03 1774122783

Are you seriously making the argument blasting music or a movie or whatever is an involuntary bodily function?

SOLAR_FIELDS · 2026-03-21T20:43:55 1774125835

Yes. Because I'm asking the question who decides what is involuntary or not. Who is it? It seems like there is a presupposition here, but who is defining that?

Coming back to the Tourette's example: let's say someone starts shouting cuss words and loudly annoying everyone else "involuntarily". Do they get kicked off the plane? Why or why not? Who decides that? Does the person have to present medical evidence that they have Tourette's to not get kicked off the plane? If so, can they also present medical evidence of a condition that causes them to spontaneously press play on their mobile devices with no headphones and would that be accepted?

I'm obviously not defending the behavior of the loud-music-on-plane-players, or advocating that everyone needs to smell everyone's farts. I'm pointing out that this is something that is arbitrary and weaponizable.

anigbrowl · 2026-03-22T02:59:21 1774148361

I vote to throw you off the plane for disingenuous baitposting.

vel0city · 2026-03-21T23:14:29 1774134869

You don't understand that a phone isn't a part of the human body? Seriously? We as a society can't even come to agreement on that basic fact anymore?

If someone shoots a gun in a crowd is that too an involuntary bodily function? Is the gun not just part of their body? Are you confused by that as well? Where do we draw the limits on what is the human body? Who decides that? If I lay on the ground does the whole earth become my body?

RobotToaster · 2026-03-21T18:57:34 1774119454

Not harsh enough. They belong in the special level of hell reserved for child molesters and people who talk in the theatre.

chisel192 · 2026-03-21T18:06:34 1774116394

> During flights? Sounds a bit harsh.

Sounds harsh to you.

Let the market decide.

Vote with your wallet and fly a different airline.

saint11 · 2026-03-21T18:11:51 1774116711

But kicking someone off mid-flight at high altitude is still a bit harsh. I hope they give them parachutes at least.

dguest · 2026-03-21T19:09:01 1774120141

FUN FACT: Aviation rules require that any plane carrying a parachute must have at least one for every person on board. Hopefully the reason is obvious.

Now given that, do you really want to pay the extra cost of flying with 300 parachutes just so mr-full-volume-phone can have one?

3eb7988a1663 · 2026-03-21T19:56:22 1774122982

That is an incredibly fun fact. Does this only apply to commercial or also a little Cessna? Presumably there is no actual enforcement on the private planes.

dguest · 2026-03-22T08:47:57 1774169277

I made it too fun: what I said was at best an over-genarlization. The actual rules [1] apply to acrobatics and say that parachutes are required for everyone when non-crew passenger is on the plane:

    Unless each occupant of the aircraft is wearing an approved parachute, no pilot of a civil aircraft carrying any person (other than a crewmember) may execute any intentional [acrobatic] maneuver...

So without the passenger no one needs a parachute, with them everyone does.

It's perfectly legal for a 787 to carry a few parachutes just for the full-volume passengers.

[1]: https://faraim.org/faa/far/cfr/title-14/part-91/section-91.3...

jjmarr · 2026-03-21T19:26:28 1774121188

I've packed my own parachute for this hypothetical situation.

HPsquared · 2026-03-21T18:22:19 1774117339

Only if they paid extra at check-in.

doubled112 · 2026-03-21T18:26:09 1774117569

And you specifically have to request it. It isn’t a normal option during purchase.

vel0city · 2026-03-21T19:57:17 1774123037

Nah, with how ticketing is these days they'll bug you a dozen times to choose between the $50 basic economy disaster package that only has the mask and 50% airflow or the full package for $100 that includes another 25% airflow and a flotation device. Business execute gets you the parachute, a private life raft, and a few days of MREs for $250.

gumby271 · 2026-03-21T18:12:33 1774116753

Bet it won't happen twice though.

MPSimmons · 2026-03-21T18:35:41 1774118141

> give them parachutes at least

the first time

andrewflnr · 2026-03-21T18:29:23 1774117763

I'm going to vote with my wallet by moving United up my priority list.

integralid · 2026-03-21T18:12:54 1774116774

Either you missed the joke or I missed your sarcasm. I read GP as a joke: being literally kicked out of a flight in air is a death sentence, which is a bit harsh penalty indeed.

osti · 2026-03-21T05:31:37 1774071097

More like oppressed people by all those bs rules.

rayiner · 2026-03-21T14:46:47 1774104407

The only thing being “oppressed” are people’s animal instincts to be disorderly.

osti · 2026-03-21T18:17:36 1774117056

There's a fine line between making ppl civilized and fascism-like level of control. And I believe Japan errs on the other side too much with their ridiculous number of such rules in all areas of life. Even though I recently visited Japan, I can't really speak to how happy they are, but the stereotype is that they are not the happiest ppl out there. I believe their obedience to all such societal rules has a role in it.

osti · 2026-03-08T01:20:47 1772932847

Not true. Geekbench, especially single threaded benchmark, is probably the best we got, it has a bunch of workloads, unlike many other benchmarks like cinebench for example. And they publish all the results on their website, so you can dig into each individual workload and find the ones that apply to you.

And like the other poster mentioned, it correlates well with SPEC, so it's basically a easily accessible SPEC. These days the only benchmark I use to quickly judge some CPU is geekbench.

dkechag · 2026-03-08T10:40:29 1772966429

May I suggest the one I use (I wrote it), which also correlates well with SPEC & Geekbench 5, but also runs the benchmarks on all cores if you want to so you get both max single-thread and max multi-thread: https://github.com/dkechag/dkbench-docker . You basically run 'docker run -it --rm dkechag/dkbench'.

osti · 2026-03-09T19:23:36 1773084216

I took a look, it's not bad but it seems to contain too many micro benchmarks like regex or primes. Geekbench at least has clang which is a subscore that I always look at.

dkechag · 2026-03-10T23:42:00 1773186120

The primes one is my least favourite one indeed, I left it in just because I happened to include it in the very first version and I am thinking it just counts for 5% in the end... The regex ones are "micro" yet quite important, dkbench it's a Perl (and C)-based benchmark (reflects our main code), and the regex engine is the most highly optimized part of the language so regex speed is a good representation of text processing speed in Perl. As I said, the overall score correlates well to SPEC/Geekbench so as a suite it works well. For compiler comparisons I usually compile a language like python or perl as a test, but I did not want to add something like that, to keep it fast with many smaller benchmarks.

osti · 2026-03-21T19:07:07 1774120027

Actually yeah, I shouldn't have said that regex is a microbenchmark, it's indeed an important one.

osti · 2026-03-05T20:00:05 1772740805

It's only that one number that is for sonnet.

0123456789ABCDE · 2026-03-05T20:46:17 1772743577

except for the webarena-verified

osti · 2026-02-26T22:07:14 1772143634

Their company is called Anthropic after all.

moogly · 2026-02-26T22:18:54 1772144334

Anthslopic is more like it.

osti · 2026-02-17T23:04:12 1771369452

ByteDance never really open sourced their models though. But I agree, they will only open source when it doesn't really matter.

osti · 2026-02-12T18:35:22 1770921322

That's what I found with some of these LLM models as well. For example I still like to test those models with algorithm problems, and sometimes when they can't actually solve the problem, they will start to hardcode the test cases into the algorithm itself.. Even DeepSeek was doing this at some point, and some of the most recent ones still do this.

qinsignificance · 2026-02-12T19:03:03 1770922983

I have asked GLM4.7 in opencode to make an application to basically filter a couple of spatial datasets hosted at a url I provided it, and instead of trying to download read the dataset, it just read the url, assumed what the datasets were (and got it wrong) is and it's shape (and got it wrong) and the fields (and got it wrong) and just built an application based on vibes that was completely unfixable.

It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.

This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.

I use LLMs a lot to code, but these chinese models don't match anthropic and openai in being able to decide stuff for themselves. They work well if you give them explicit instructions that leaves little for it to mess up, but we are slowly approaching where OpenAI and anthropic models will make the right decisions on their own

hsaliak · 2026-02-12T22:43:20 1770936200

this aligns perfecly with my experience, but of course, the discourse on X and other forums are filled with people who are not hands on. Marketing is first out of the gate. These models are not yet good enough to be put through a long coding session. They are getting better though! GLM 4.7 and Kimi 2.5 are alright.

esafak · 2026-02-12T19:29:27 1770924567

It really is infuriatingly dumb; like a junior who does not know English. Indeed, it often transitions into Chinese.

Just now it added some stuff to a file starting at L30 and I said "that one line L30 will do remove the rest", it interpreted 'the rest' as the file, and not what it added.

edoceo · 2026-02-12T18:55:28 1770922528

Sounds exactly what a junior-dev would do without proper guidance. Could better direction in the prompts help? I find I frequently have to tell it where to put what fixes. IME they make a lot of spaghetti (LLMs and juniors)

throawayonthe · 2026-02-12T21:10:11 1770930611

wtf kinda juniors are you interacting with

edoceo · 2026-02-12T21:59:27 1770933567

Lots of self-taught; looking for an entry level.

alsetmusic · 2026-02-13T03:54:21 1770954861

I'm self-taught and I've always understood that adjusting tests to cheat is a fail.

osti · 2026-02-10T23:10:45 1770765045

So now you don't want capitalism?

heraldgeezer · 2026-02-10T23:57:10 1770767830

Okay, lets do your job and career next! Just capitalism bro.

hackeraccount · 2026-02-11T17:48:02 1770832082

There was a debate with Mike Dukakis when one of the moderators asked if he would want the death penalty for someone who killed his wife. He gave some cold blooded answer.

The real answer was probably - I shouldn't be the one who decides what happened to the person who killed my wife.

In the same way - It shouldn't be up to me if I get fired or my job gets shipped overseas or done by an AI. If we're in a free market it's up to all the people who are buying what I'm making. If there's a cheaper way why wouldn't they take it?