Hacker Newsnew | past | comments | ask | show | jobs | submit | nowittyusername's commentslogin

How does the whole kv cache situation work for diffusion models? Like are there latency and computation/monetary savings for caching? is the curve similar to auto regressive caching options? or maybe such things dont apply at all and you can just mess with system prompt and dynamically change it every turn because there's no savings to be had? or maybe you can make dynamic changes to the head but also get cache savings because of diffusion based architecture?... so many ideas...

There are many ways to do it, but the simplest approach is block diffusion: https://m-arriola.com/bd3lms/

There are also more advanced approaches, for example FlexMDM, which essentially predicts length of the "canvas" as it "paints tokens" on it.


Nice, I'm excited to try this for my voice agent, at worst it could be used to power the human facing agent for latency reduction.

Would love to hear about your experience. Send us an email.

As we approach the singularity things will be more noisy and things will make less and less sense as rapid change can look like chaos from inside the system. I recommend folks just take a deep breath, and just take a look around you. Regardless on your stance if the singularity is real, if AI will revolutionize everything or not, just forget all that noise. just look around you and ask yourself if things are seeming more or less chaotic, are you able to predict better or worse on what is going to happen? how far can your predictions land you now versus lets say 10 or 20 years ago? Conflicting signals is exactly how all of this looks. one account is saying its the end of the world another is saying nothing ever changes and everything is the same as it always was....

Do you have any resources or youtube videos that might also help someone understand the lcm context management a bit better. I think there's something to this, but i'm having trouble wrapping my head around it. i learn well with analogies and im trying to really grok the concept here. If there are other ways you could explain it it would be appreciated. mind you i have built my own agents from scratch so im not a total novice in these areas. my agents already manage context with sub-agents and multi layered conversational histories with RAG thrown in there. But i dont want to make wrong assumptions about your implementations and miss the nuanced important bits. regardless, ill try my best to reread the article and hash it out on my own, thanks for the paper.

Hi NWU,

We don't have any other materials yet, but let's see if this lands for you. I can run you through a couple simpler versions of the system, why they don't work, and how that informs our ultimate design.

The most basic part of the system is "two layers". Layer 1 is the "ground truth" of the conversation - the whole text the user sees. Layer 2 is what the model sees, i.e., the active context window.

In a perfect world, those would be the same thing. But, as you know, context lengths aren't long enough for that, so we can't fit everything from Layer 1 into Layer 2.

So instead we keep a "pointer" to the appropriate part of Layer 1 in Layer 2. That pointer takes the form of a summary. But it's not a summary designed to contain all information. It's more like a "label" that makes sure the model knows where to look.

The naive version of the system would allow the main model to expand Layer 2 summaries by importing all of the underlying data from Layer 1. But this doesn't work well, because then you just end up re-filling the Layer 2 context window.

So instead you let the main model clone itself, the clone expands the summary in its context (and can do this for multiple summaries, transforming each into the original uncompressed text), and then the clone returns whatever information the main thread requires.

Where this system would not fully match the capabilities of RLMs is that, by writing a script that calls itself e.g. thousands of times, an RLM has the ability to make many more recursive tool calls than can fit in a context window. So we fix that using operator-level recursion, i.e., we give the LLM a tool, map, that executes arbitrary recursion, without the LLM having to write a custom script to accomplish that.

Hope this helps!

- Clint


I am in the process of trying to integrate LCM in to my own personal assistant agent for its context management system. The main human facing agent will not be a coding agent so ill be modifying the system prompt and some other things quite heavily but core concepts of the system will be as the backbone. Now that I am paying around with it, I am hoping you can answer some questions. I notice that the system prompt of the agent mutates as local time is injected in to the system prompt itself. If that's whats happening, you are destroying any hopes of caching from the provider are you not? Am I reading this correctly or was this a deliberate choice for some reason... instead of appending at the end of the users turn like a system metadata info that way you preserve the head? Thanks.

Thanks for the reply. That does help.

Good article and I agree with everything in there. For my own voice agent I decided to make him PTT by default as the problems of the model accurately guessing the end of utterance are just too great. I think it can be solved in the future but, I haven't seen a really good example of it being done with modern day tech including this labs. Fundamentally it all comes down to the fact that different humans have different ways of speaking, and the human listening to them updates their own internal model of the speech pattern. Adjusting their own model after a couple of interactions and arriving at the proper way of speaking with said person. Something very similar will need to be done and at very fast latency's for it to succeed in the audio ml world. But I don't think we have anything like that yet. It seems currently best you can do is tune the model on a generic speech pattern that you expect to fit over a larger percentage of the human population and that's about the best you can do, anyone who falls outside of that will feel the pain of getting interrupted every time.

Check out Sparrow-0. The demo shows an impressive ability to predict when the speaker has finished talking:

https://www.tavus.io/post/sparrow-0-advancing-conversational...


Thanks, ill read it now.

It feels like this is one of those areas where the last 10% of polish will take 90% of the effort

The 80/20 rule always wins

I wholeheartedly agree. In an age of talking heads. you will not hear from the people actually doing the thing. because they too busy doing the thing versus talking about it. now excuse me ima go back to doing the thing.


I have been working on playing around with over 10 stt systems in last 25 days and its really weird to read this article as my experience is the opposite. Stt models are amazing today. They are stupid fast, sound great and very simple to implement as huggingface spaces code is readily available for any model. Whats funny is that the model he was talking about "supertonic" was exactly the model I would have recommended if people wanted to see how amazing the tech has become. The model is tiny, runs 55x real time on any potato and sounds amazing. Also I think he is implementing his models wrong. As he mentions that some models don't have streaming and you have to wait for the whole chunk to be processed. But that's not a limit in any meaningful way as you can define the chunk. You can simply make the first n characters within the first sentence be the chunk and process that first and play that immediately while the rest of the text is being processed. ttfs and ttfa on all modern day models is well below 0.5 and for supertonic it was 0.05 with my tests.....


What screenreaders are you using to test the models with?


What's your experience at high speeds, with garbled speech artifacts and pronouncation accuracy?


With supertonic , or overall? If overall most do pretty well though some are funky, like suprano was so bad no matter what I did, so i had to rule that out from my top contenders on anything. supertonic was close to my number one choice for my agentic pipeline as it was soo insanely fast and quality was great, but it didnt have the other bells and whistles like some other models so i held that off for cpu only projects in the future. If you are gonna use it on a GPU I would suggest chatterbox or pocket tts. Chatterbox is my top contender as of now because it sounds amazing, has cloning and i got it down to 0.26 ttfa/ttsa once i quantized it and implemented pipecat in to it. pocket tts is probably my second choice for similar reasons.


>Also I think he is implementing his models wrong.

This is something I've noticed around a lot of AI related stuff. You really can't take any one article on it as definitive. This, and anything that doesn't publish how they fully implemented it is suspect. That's both for the affirmative and negative findings.

It reminds me a bit of the earlier days of the internet were there was a lot of exploration of ideas occurring, but quite often the implementation and testing of those ideas left much to be desired.


Minor nitpick, but you mean "tts" not "stt" both times.

Is supertonic the best sounding model, or is there a different one you'd recommend that doesn't perform as well but sounds even better?


yes sorry i mixed these up. supertonic is not the best sounding in my tests. it was by far the fastest, but its audio quality for something so fast was decent. if you wanted something that sounds better AND is also extremely fast pocket tts is the choice. amazing quality and also crazy fast on both gpu and cpu. if you care mainly about quality, chatterbox in my tests was best fit, but its slower then the others. qwen 3 tts was also great but its unisable as any real time agentic voice as its too slow. they havent relesed the code for streaming yet, once they release that this will be my top contender.


Thanks!


Just found this video ... it looks to sound and work -very- well. (RasPI & Onyx)

https://www.youtube.com/watch?v=bZ3I76-oJsc


Are you using them at 1000 wpm?


Supertonic is probably way faster then that, I wouldn't be surprised if measured it would be something like 14k wpm. On my 4090 I was getting about 175x real time while on cpu only it was 55x realtime. I stopped optimizing it but im sure it could be pushed further. Anyways you should check out their repo to test it yourself its crazy what that team accomplished!


Audio synthesis speed is one thing, but is the output _intelligible to a human_ at 1,000wpm? That's the sort of thing Eloquence is being used for, according to the article.


TTS has no intelligence bud. Its only something that transforms text to audio. And that is all that we are talking about here. neither the article or anyone else was discussing the whole stt > llm > tts pipeline.



Did you even read the article bud


Its an issue that is caused by many factors which are mostly related to the way our large scale societies are structured and ran, but I believe it will be solved very soon... By AI. first disclaimer, I am not advocating one way or the other for this just spelling out what I see on the horizon. Very soon AI systems will become a lot more sophisticated then your average chat bot. We will interact with them naturally through voice and they will become more capable in expressing the various nuances of the human speech, conversation cadence, etc... This is where humans will find solace. In fact i believe AI will be a humans best friend, lover, parent, child, etc.... as technology progresses and these things get embodied and so on. This year alone I expect the start of mass adoption of voice agents. But yeah, that's the way i see things play out. If I am right and things go this way, and you are interacting with these things, the smart move is to make sure you own the full stack 100% and not use the api related nonsense that will eventually brainwash you for this or that reason. If you are gonna dig a hole at least dig one that doesn't have the obvious traps in it.


Thanks for heads up, this looks really interesting and claimed speed is nuts..


This is perfect for me. I just started working on the voice related stuff for my agent framework and this will be of real use. Thanks.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: