the actual license text part being questioned .
Data Information: Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system. Data Information shall be made available under OSI-approved terms.
In particular, this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected, the labeling procedures, and data processing and filtering methodologies; (2) a listing of all publicly available training data and where to obtain it; and (3) a listing of all training data obtainable from third parties and where to obtain it, including for fee.
(The rest of the license goes on to talk about weights, etc).
I agree with you somewhat. I’m glad that each source does need to be listed and described. I’m less thrilled to see “unshareable” data and data that cost $ in there since i think these have potential to effectively make a model not able to be retrained by a “skilled person”.
It’s a cheap way to make an AI license without making all the training data open source (and dodging the legalities of that).
Right, the other thing i considered is that you could just create a company and “buy” the data from them for a ridiculous amount of money and then you have less requirement to detail the data. Similarly you could deem the data unsharable and fudge the provenance.
Like locks, it will only keep honest people honest.