This is hardly surprising. It’s immediately noticeable in images, but we’ll have to be very careful with other forms of output as the decline could be subtle enough to go unnoticed at first. There’s a very real risk of poisoning our sources of data by allowing AI to write back to them without oversight. And given that the sources of data seem to be things like Reddit and twitter this is a real concern.
The problem is with pushing out the edge of a normal distribution curve by the output regression to the mean.
Merely mixing what’s fed back in between AI generated and human generated would avoid this outcome, and arguably as long as the AI output was generally rated as better than the mean human output would even lead to recursive mixed training iterations improving over the original models.
This and the Stanford paper were problematic exclusively training on new AI generated output over and over, which increased median tokens or diffusions and dropped edges until you ended up with output that had overfitted lackluster discrimination.
The real takeaway here isn’t “oh noz, we can’t feed AI output back into AI training” but rather “humans in both generator and discriminator roles will be critical in future AI training.”
There’s been a recent troubling trend of binaryisms in the ML field as hype and attention has increased, and it’s important to be careful not to improperly extrapolate a finding of a narrow scope to an overly broad interpretation.
So yes, don’t go training recursively on only synthetic data over and over. But even something as simple as using humans upvoting or downvoting the generations to decide if you feed them back in or don’t (i.e. human discriminator and AI generator) would largely avoid the outcomes here.
Which means that human selection of the ‘best’ output from several samples for initial sharing and human rating of shared outputs for broader distribution is already cleaning up AI generations online enough that fears of ‘poisoning’ the data as suggested here and in the Stanford study are almost certainly overblown.
Edit: From section 5 of the paper it even addresses some of this.
One might suspect that a complimentary perspective to the previous observation—that fresh new data mitigates the MAD generative process—is that synthetic data hurts a fresh data loop generative
process. However, the truth appears to be more nuanced. What we find instead is that when we mix
synthetic data trained on previous generations and fresh new data, there is a regime where modest
amounts of synthetic data actually boost performance, but when synthetic data exceeds some critical threshold, the models suffer.
This is hardly surprising. It’s immediately noticeable in images, but we’ll have to be very careful with other forms of output as the decline could be subtle enough to go unnoticed at first. There’s a very real risk of poisoning our sources of data by allowing AI to write back to them without oversight. And given that the sources of data seem to be things like Reddit and twitter this is a real concern.
Not really.
The problem is with pushing out the edge of a normal distribution curve by the output regression to the mean.
Merely mixing what’s fed back in between AI generated and human generated would avoid this outcome, and arguably as long as the AI output was generally rated as better than the mean human output would even lead to recursive mixed training iterations improving over the original models.
This and the Stanford paper were problematic exclusively training on new AI generated output over and over, which increased median tokens or diffusions and dropped edges until you ended up with output that had overfitted lackluster discrimination.
The real takeaway here isn’t “oh noz, we can’t feed AI output back into AI training” but rather “humans in both generator and discriminator roles will be critical in future AI training.”
There’s been a recent troubling trend of binaryisms in the ML field as hype and attention has increased, and it’s important to be careful not to improperly extrapolate a finding of a narrow scope to an overly broad interpretation.
So yes, don’t go training recursively on only synthetic data over and over. But even something as simple as using humans upvoting or downvoting the generations to decide if you feed them back in or don’t (i.e. human discriminator and AI generator) would largely avoid the outcomes here.
Which means that human selection of the ‘best’ output from several samples for initial sharing and human rating of shared outputs for broader distribution is already cleaning up AI generations online enough that fears of ‘poisoning’ the data as suggested here and in the Stanford study are almost certainly overblown.
Edit: From section 5 of the paper it even addresses some of this.