Hacker Newsnew | past | comments | ask | show | jobs | submit | Tiberium's commentslogin


I checked the current speed over the API, and so far I'm very impressed. Of course models are usually not as loaded on the release day, but right now:

- Older GPT-5 Mini is about 55-60 tokens/s on API normally, 115-120 t/s when used with service_tier="priority" (2x cost).

- GPT-5.4 Mini averages about 180-190 t/s on API. Priority does nothing for it currently.

- GPT-5.4 Nano is at about 200 t/s.

To put this into perspective, Gemini 3 Flash is about 130 t/s on Gemini API and about 120 t/s on Vertex.

This is raw tokens/s for all models, it doesn't exclude reasoning tokens, but I ran models with none/minimal effort where supported.

And quick price comparisons:

- Claude: Opus 4.6 is $5/$25, Sonnet 4.6 is $3/$15, Haiku 4.5 is $1/$5

- GPT: 5.4 is $2.5/$15 ($5/$22.5 for >200K context), 5.4 Mini is $0.75/$4.5, 5.4 Nano is $0.2/$1.25

- Gemini: 3.1 Pro is $2/$12 ($3/$18 for >200K context), 3 Flash is $0.5/$3, 3.1 Flash Lite is $0.25/$1.5


Yeah, this speed is excellent! I'm using GPT-5 mini for my "AI tour guide" (simply summarizes Wikipedia articles for me on the fly, which are presented on my app based on geolocation), and it's always been a ~15 second wait for me before streaming of a large article summarization will begin. With GPT-5.4 it's around 2-3 seconds, and the quality seems at least as good. This is a huge UX improvement, it really starts to feel more 'real time'.

IME tok/s is only useful with the additional context of ttft and total latency. At this point a given closed-model does not exist in a vaccuum but rather in a wider architecture that affects the actual performance profile for an API consumer.

This isn't usually an issue comparing models within the same provider, but it does mean cross-provider comparison using only tok/s is not apples-to-apples in terms of real-world performance.


Exactly. Really frustrating they don't advertise TTFT and etc, and that it's really hard to find any info in that regard on newer models.

For voice agents gpt-4.1 and gpt-4.1-mini seem to be the best low latency models when you need to handle bigger data more complex asks.

But they are a year old and trying to figure out if these new models(instant, chat, realtime, mini, nona, wtf) are a good upgrade is very frustrating. AFAICT they aren't; the TTFT latencies are too high.


Curious to hear why people pick GPT and Claude over Google (when sometimes you’d think they have a natural advantage on costs, resources and business model etc)?

Because Claude is so much more expensive, and I rarely need the best.

gpt-5.4 is really good now also for tricky problems. Just for the unsolvable problems we take opus-4.6. Or if someone pays for it.


In my workplace, its availability. We have to use US-only models for government-compliance reasons, so we have access to Opus 4.6 and GPT 5.4, but only Gemini 2.5 which isn't in the same class as the first two.

Have you used gemini models for code work? Claude and Codex are miles ahead in terms of quality and how thorough they are

I wish someone would also thoroughly measure prompt processing speeds across the major providers too. Output speeds are useful too, but more commonly measured.

In my use case for small models I typically only generate a max of 100 tokens per API call, with the prompt processing taking up the majority of the wait time from the user perspective. I found OAI's models to be quite poor at this and made the switch to Anthropic's API just for this.

I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.


OpenRouter has this information

I do not see prompt processing, only some kind of nebulous “throughput” that could be output or input+output, but definitely not input only.

token/sec is meaningless without thinking level. If it's fast but keeps rambling about instead of jumping on it then it can take a very long time vs low token/sec but low/none thinking.

Man the lowest end pricing has been thoroughly hiked. It was convenient while it lasted.

Wow. How fast is haiku?

I have a suspicion that this paper is basically a summary with some benchmarks, done with LLMs.

Your suspicion could have easily been cleared by reading the paper.

If you're short on time: the paper reads a bit dry, but falls in the norm for academic writing. The github repo shows work over months on 2024 (leading up to the release of 3.13) and some rush on Dec 2025 to Jan 2026, probably to wrap things up on the release of this paper. All commits on the repo are from the author, but I didn't look through the code to inspect if there was some Copilot intervention.

[0] https://github.com/Joseda8/profiler


They don't actually seem to charge more for the >200k tokens on the API. OpenRouter and OpenAI's own API docs do not have anything about increased pricing for >200k context for GPT-5.4. I think the 2x limit usage for higher context is specific to using the model over a subscription in Codex.

Didn't the Google - Oracle case about Java APIs in Android https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_.... directly disprove this?

In the end, the supreme court case decided that the re-implementation fell under fair use, it did not answer the copyright question.

People are just not realizing this now because it's mostly hobby projects and companies doing it in private, but eventually everyone will realize that LLMs allow almost any software to be reverse engineered for cheap.

See e.g. https://banteg.xyz/posts/crimsonland/ , a single human with the help of LLMs reverse engineered a non-trivial game and rewrote it in another language + graphics lib in 2 weeks.



Seedream 5 Lite is honestly extremely disappointing, its text to image is way worse than 4.5, image editing is fine but that's it. It's way, wayy behind NB2.


It's only for Europe, you should try a US VPN or, in the worst case, use it over Vertex AI, which allows you to generate anyone.


The OP's comment to the post is clearly Markdown-formatted, real humans don't write like that on HN.

The readme is very obviously Claude-written (or a similar model - certainly not GPT), if you check enough vibecoded projects you'll easily spot those readmes.

The style of the HTML page, as noted by others.

Useless comments in the source code, which humans also do, but LLMs do more often:

// Basic random double

static inline double rand_double() { return (double)rand() / (double)RAND_MAX; }


I did not. The html was generated by Deepseek. Claude is far way too expensive for that. This is only an experimental code. I don't think it is worth to pay Claude to test a code which was already peer reviewed theoretically.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: