There's of course also the issue that an increasing fraction of web content reading is being done by AI agents. I wonder what the Pareto front here is.
No one has successfully rebutted that paper about stochastic collapse of AI models which happens when models train on their own output over time. Itβs just a matter of time before we find out if it was right or not.
There are dozens if not hundreds of papers (by major research labs) showing that training on synthetic data for LLM's actually improves their performance. For instance, much of the RLHF done by MS/Facebook likely used data generated by an LLM. DeepSeek has also seen similar accusations thrown their way.
I believe the paper you're referencing was narrowly discussing text to image models and didn't incorporate the notion of prompt engineering and good old fashioned search to improve the quality of synthetic data.
it's been awhile though, so i could be wrong. effectively i'm saying it's not quite as simple as that and isn't necessarily some unsolvable doomsday clock for all LLM's.
This is going to reshape large portions of our text based communication networks.