Analysis shows that indiscriminately training generative artificial intelligence on real and generated content, usually done by scraping data from the Internet, can lead to a collapse in the ability of the models to generate diverse high-quality output.
You realize that those “billions of dollars” have actually resulted in a solution to this? “Model collapse” has been known about for a long time and further research figured out how to avoid it. Modern LLMs actually turn out better when they’re trained on well-crafted and well-curated synthetic data.
Honestly, everyone seems to assume that machine learning researchers are simpletons who’ve never used a photocopier before.
Holy shit are you telling me…
Garbage In…
= Garbage Out?
No, that can’t be it, throw billions and billions of dollars at this instead of, I don’t know, housing the homeless.
You realize that those “billions of dollars” have actually resulted in a solution to this? “Model collapse” has been known about for a long time and further research figured out how to avoid it. Modern LLMs actually turn out better when they’re trained on well-crafted and well-curated synthetic data.
Honestly, everyone seems to assume that machine learning researchers are simpletons who’ve never used a photocopier before.