No: soon the wide wild world itself becomes training data. And for much more than just an LLM. LLM plus reinforcement learning—this is were the capacity of our in silico children will engender much parental anxiety.
However, I think the most cost-effective way to train for real world is to train in a simulated physical world first. I would assume that Boston Dynamics does exactly that, and I would expect integrated vision-action-language models to first be trained that way too.
That's how everyone in robotics is doing these days.
You take a bunch of mo-cap data and simulate it with your robot body. Then as much testing as you can with the robot and feed the behavior back in to the model for fine tuning.
Unitree gives an example of the simulation versus what the robot can do in their latest video
It is a limiting factor, due to diminishing returns. A model trained on double the data, will be 10% better, if that!
When it comes to multi-modality, then training data is not limited, because of many different combinations of language, images, video, sound etc. Microsoft did some research on that, teaching spacial recognition to an LLM using synthetic images, with good results. [1]
When someone states that there are not enough training data, they usually mean code, mathematics, physics, logical reasoning etc. In the open internet right now, there are is not enough code to make a model 10x better, 100x better and so on.
Synthetic data will be produced of course, scarcity of data is the least worrying scarcity of all.
> video generation also seemed kind of stagnant before Sora
I take the opposite view. I don't think video generation was stagnating at all, and was in fact probably the area of generative AI that was seeing the biggest active strides. I'm highly optimistic about the future trajectory of image and video models.
By contrast, text generation has not improved significantly, in my opinion, for more than a year now, and even the improvement we saw back then was relatively marginal compared to GPT-3.5 (that is, for most day-to-day use cases we didn't really go from "this model can't do this task" to "this model can now do this task". It was more just "this model does these pre-existing tasks, in somewhat more detail".)
If OpenAI really is secretly cooking up some huge reasoning improvements for their text models, I'll eat my hat. But for now I'm skeptical.
> By contrast, text generation has not improved significantly, in my opinion, for more than a year now
With less than $800 worth of hardware including everything but the monitor, you can run an open weight model more powerful than GPT 3.5 locally, at around 6 - 7T/s[0]. I would say that is a huge improvement.
It doesn't seem that way to me. But even if it did, video generation also seemed kind of stagnant before Sora.
In general, I think The Bitter Lesson is the biggest factor at play here, and compute power is not stagnating.