Yeah, this speed is excellent! I'm using GPT-5 mini for my "AI tour guide" (simply summarizes Wikipedia articles for me on the fly, which are presented on my app based on geolocation), and it's always been a ~15 second wait for me before streaming of a large article summarization will begin. With GPT-5.4 it's around 2-3 seconds, and the quality seems at least as good. This is a huge UX improvement, it really starts to feel more 'real time'.
IME tok/s is only useful with the additional context of ttft and total latency. At this point a given closed-model does not exist in a vaccuum but rather in a wider architecture that affects the actual performance profile for an API consumer.
This isn't usually an issue comparing models within the same provider, but it does mean cross-provider comparison using only tok/s is not apples-to-apples in terms of real-world performance.
Exactly. Really frustrating they don't advertise TTFT and etc, and that it's really hard to find any info in that regard on newer models.
For voice agents gpt-4.1 and gpt-4.1-mini seem to be the best low latency models when you need to handle bigger data more complex asks.
But they are a year old and trying to figure out if these new models(instant, chat, realtime, mini, nona, wtf) are a good upgrade is very frustrating. AFAICT they aren't; the TTFT latencies are too high.
Curious to hear why people pick GPT and Claude over Google (when sometimes you’d think they have a natural advantage on costs, resources and business model etc)?
In my workplace, its availability. We have to use US-only models for government-compliance reasons, so we have access to Opus 4.6 and GPT 5.4, but only Gemini 2.5 which isn't in the same class as the first two.
I wish someone would also thoroughly measure prompt processing speeds across the major providers too. Output speeds are useful too, but more commonly measured.
In my use case for small models I typically only generate a max of 100 tokens per API call, with the prompt processing taking up the majority of the wait time from the user perspective. I found OAI's models to be quite poor at this and made the switch to Anthropic's API just for this.
I've found Haiku to be a pretty fast at PP, but would be willing to investigate using another provider if they offer faster speeds.
token/sec is meaningless without thinking level. If it's fast but keeps rambling about instead of jumping on it then it can take a very long time vs low token/sec but low/none thinking.
Your suspicion could have easily been cleared by reading the paper.
If you're short on time: the paper reads a bit dry, but falls in the norm for academic writing. The github repo shows work over months on 2024 (leading up to the release of 3.13) and some rush on Dec 2025 to Jan 2026, probably to wrap things up on the release of this paper. All commits on the repo are from the author, but I didn't look through the code to inspect if there was some Copilot intervention.
They don't actually seem to charge more for the >200k tokens on the API. OpenRouter and OpenAI's own API docs do not have anything about increased pricing for >200k context for GPT-5.4. I think the 2x limit usage for higher context is specific to using the model over a subscription in Codex.
People are just not realizing this now because it's mostly hobby projects and companies doing it in private, but eventually everyone will realize that LLMs allow almost any software to be reverse engineered for cheap.
See e.g. https://banteg.xyz/posts/crimsonland/ , a single human with the help of LLMs reverse engineered a non-trivial game and rewrote it in another language + graphics lib in 2 weeks.
Seedream 5 Lite is honestly extremely disappointing, its text to image is way worse than 4.5, image editing is fine but that's it. It's way, wayy behind NB2.
The OP's comment to the post is clearly Markdown-formatted, real humans don't write like that on HN.
The readme is very obviously Claude-written (or a similar model - certainly not GPT), if you check enough vibecoded projects you'll easily spot those readmes.
The style of the HTML page, as noted by others.
Useless comments in the source code, which humans also do, but LLMs do more often:
I did not. The html was generated by Deepseek. Claude is far way too expensive for that. This is only an experimental code. I don't think it is worth to pay Claude to test a code which was already peer reviewed theoretically.
Direct image: https://pbs.twimg.com/media/HDoN4PhasAAinj_?format=png&name=...
reply