Hacker Newsnew | past | comments | ask | show | jobs | submit | nshm's commentslogin

> Running a family was a brutal two-person job -- and the kids had to dive in to help out the second they could lift something heavier than a couple pounds.

Orphanes did struggle but most families were not just two person, families were big and supported by community.


You can check whale sound recognition project https://arxiv.org/abs/2104.08614


And moreover, you can not tune those models for practical applications. The model is originally trained on very clean data, so lower layers are also not very stable for diverse inputs. To finetune you have to update the whole model, not just upper layers.


In section 5.7.5, they fine-tune for "11 low-resource languages, with between 5-10 hours of training data and at least 1 hour of validation splits." "CTC fine-tuning takes ≈1 hour of walltime on 32 GPUs for the 300M scale." If that's too expensive, you also have the option of supplying additional context for the LLM-based model (section 5.5).

As for "very clean data," see section 5.7.4: "Omnilingual + OMSF ASR was intentionally curated to represent naturalistic (i.e., often noisy) audio conditions, diverse speaker identities, and spontaneous, expressive speech."


This model is actually expected to be bad for popular languages, just like previous MMS it is not accurate at all, it wins by supporting something rare well but never had good ASR accuracy even for Swedish etc. It is more a research thing than a real tool. Unlike Whisper.


It is useless actually. Very slow and quality is suboptimal and it is just speech generation component. See discussion here:

https://github.com/SesameAILabs/csm/issues/80


No, there are mathematical reasons LLMs are better. They are trained with multiobjective loss (coding skills, translation skills, etc) so they understand the world much better than MLM. Original post discuss that but with more words and points than necessary.


GPTs also get gradients from all tokens, BERT only on 15% masked tokens. GPTs are more effective.


Call it a CLM vs MLM, not LLM vs MLM. Soon LMLM's will exist, which will be LLMs too...


It is actually pretty straightforward why those model "reason" or, to be more exact, can operate on a complex concepts. By processing huge amount of texts they build an internal representation where those concepts are represented as a simple nodes (neurons or groups). So they really distill knowledge. Alternatively you can think about it as a very good principal component analysis that can extract many important aspects. Or like a semantic graph built automatically.

Once knowledge is distilled you can build on top of it easily by merging concepts for example.

So no secret here.


Do they distill knowledge or distill the relationship between words (that describe knowledge)

I know it seems dancing on head of pin but …


Well the internal representation is tokens not words so.. the pin is even smaller?

They distill relationships between tokens. Multiple tokens together make up a word, and multiple words together make up a label for something we recognize as a "concept".

These "concepts" are not just a label though - they are an area in the latent space inside the neural network which happens to contains those words in the sequence (along with other labels that mean similar things).

A simple demonstration of this is how easily multi-modal neural networks build cross modal representations of the same thing, so "cats" end up in the same place in both image and word form but also more complex concepts ("a beautiful country fields with a foreboding thunderstorm forming") will also align well between the words and the images.


> Do they distill knowledge or distill the relationship between words (that describe knowledge)

Do we know that there's a difference between the two? Maybe this distinction is just a god of the gaps.


There is also a glitch in "dialogue"


Anyone except me thinks he doesn't look very healthy? Its strange he is kind of slow on the video where he enters the room. Maybe some biohacking.


Err, I deeply respect Amazon TTS team but this paper and synthesis is..... You publish the paper in 2024 and include YourTTS in your baselines to look better. Come on! There is XTTS2 around!

Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.


xtts2 is great, but it looks like this model is probably more consistent with its output and has a better grasp of meaning in long texts.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: