Wow, that’s a lot to digest with the many perspectives participants in this thread have already contributed. Contributing to a compendium of datasets is a good idea, however, whether a machine learning model is able to be trained based on more signal or noise is not an issue of the volume or quality of the data available. I would propose that is more of a problem of “missing feature values”.
The paper “Data Preprocessing For Supervised Learning”[1] illustrates this problem,
In many applications learners are faced with imbalanced data sets, which can cause the learners to be biased towards one class. This bias is the result of one class being heavily under-represented in the training data compared to other classes… Classes containing few examples can be largely ignored by learning algorithms because the cost of performing well on the overrepresented class outweighs the cost of doing poorly on the smaller class. Another factor contributing to bias is overfitting. Overfitting occurs when a learning algorithm creates a hypothesis that performs well over unseen data. This can occur on an underrepresented class because the learning algorithm creates a hypothesis that can easily fit a small number of examples, but fits them too specifically.
In real-world application scenarios, data sets are rarely ever balanced which can give the correct signals for any machine learning model to pick up on for a “proper” fit. That is the reason why thorough data mining for feature engineering is oftentimes more important than the fitting of the machine learning model itself. “Data Preprocessing For Supervised Learning” also listed several conventional data preprocessing techniques to tackle this issue. In the context of data mining techniques, it is also important to note that without a rigorous empirical background, data mining can also oftentimes be equivalent to “data snooping”[2]. For example, how an amateur practitioner who does not have the capability to differentiate between spontaneous and explainable results, utilizes specific data points for inferences or model selection without any prior knowledge of what the significance of the specific data point is.
“data snooping” is widely acknowledged to be a dangerous practice because it alters the asymptotic distribution of the null hypothesis in statistical studies.[2] However, since any end performance metric of a model can only be determined by the incumbent “testing data set” and the feature labels that are represented in the “testing data set”, I do not think labeling whether someone has contributed “good” or “bad” data be the subject of discussion.
Andrew ng has recently started to promote the idea of smart sized, “data centric” AI rather than the conventional “Big data” approach. Here’s a related article from [spectrum.ieee.org].(Andrew Ng: Unbiggen AI - IEEE Spectrum). In light of this, it may just be the implication that more empirical studies need to be performed by researchers rather than having the lot of commercial practitioners throw the “train test split” at an algorithm and wonder why the model is not outputting “appropriate” estimates. I would also suggest these statistical implications also be integrated into the heuristical aggregations of the smart contract of this framework
[1]Kotsiantis, S.B. et al. “Data Preprocessing For Supervised Learning”. Citeseerx.Ist.Psu.Edu, 2021, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.8413&rep=rep1&type=pdf.
[2]White, Halbert. “A Reality Check for Data Snooping.” Econometrica, vol. 68, no. 5, [Wiley, Econometric Society], 2000, pp. 1097–126, A Reality Check for Data Snooping on JSTOR.