Then you don't understand Machine Learning in any real way. Literally the 3rd or 4th thing you learn about ML is that for any given problem, there is an ideal model size. Just making the model bigger doesn't work because of something called the curse of dimensionality. This is something we have discovered about every single problem and type of learning algorithm used in ML. For LLMs, we probably moved past the ideal model size about 18 months ago. From the POV of something who actually learned ML in school (from the person who coined the term), I see no real reason to think that AGI will happen based upon the current techniques. Maybe someday. Probably not anytime soon.
PS The first thing you learn about ML is to compare your models to random to make sure the model didn't degenerate during training.
Doesn’t sound like you paid all that much attention when learning ML. The curse of dimensionality doesn’t say that every problem has
some ideal model size, it says that the amount of data needed to train scales with the size of the feature space.
So if you take an LLM, you can make the network much larger but if you don’t increase the size of the input token vocabulary you aren’t even subject to the curse of dimensionality.
Beyond that, there’s a principle in ML theory that says larger models are almost always better because the number of params in the model is the dimensionality of the space in which you’re running gradient descent and with every added dimension, local optima become rarer.
> Literally the 3rd or 4th thing you learn about ML is that for any given problem, there is an ideal model size.
From my understanding this is now outdated. The deep double descent research showed that although past a certain point performance drops as you increase model size, if you keep increasing it there is another threshold where it paradoxically starts improving again. From that point onwards increasing the parameter count only further improves performance.
That isn't what that research says at all. What that research says is that running the same training data through multiple times improves training. There is still an ideal model size though, it is just impacted by the total volume of training data.
https://arxiv.org/pdf/1912.02292
"We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better."
That is the first sentence of the abstract. The first graph shown in the paper backs it up.
Looking into it further, it seems that typical LLMs are in the first descent regime anyway though so my original point is not too relevant for them anyway it seems. Also it looks like the second descent region doesn't always reach a lower loss than the first, it appears to depend on other factors as well.
Um, what? Are you interpreting scaling to mean adding parameters and nothing else?
I'm not entirely sure where you get your confidence that we've past the ideal model size from, but at least that's a clear prediction so you should be able to tell if and when you are proven wrong.
Just for the record, do you care to put an actual number on something we won't go past?
[edit]
Vibe check on user comes out as
Contrarian 45%
Pedantic 35%
Skeptical 15%
Direct 5%
>When I last checked, of over 10k posts, it only uses a few dozen to calculate that score, so it is about as reliable as dowsing.
A few samples are sufficient when the signal is strong enough. The time spent pie chart is definitely more what the user has been doing recently.
Overall, not everybody comes out the same, Pedantry is strong which I'm not really surprised about for a forum like this, but there are definitely personality traits of some users of sufficient magnitude that you can guess what the result will be.
Looking at the last 10 users who posted comments on HN are
Obviously this won't be a representative sample of HN because it will vary by time of day and topics under discussion. It's sufficient to show that the community is not entirely homogeneous.
PS The first thing you learn about ML is to compare your models to random to make sure the model didn't degenerate during training.