Research Summary: Leveraging Blockchain for Greater Accessibility of Machine Learning Models

It’s really great to see the insightful discussion here. It seems like most questions are answered well or at least addressed but I’m happy to clarify more. We also have an FAQ here: GitHub - microsoft/0xDeCA10B: Sharing Updatable Models (SUM) on Blockchain

A key assumption in the system, is that users should monitor a proxy for the model’s accuracy. Hence the plot showing the accuracy over time: 0xDeCA10B/simulation at main · microsoft/0xDeCA10B · GitHub. It’s crucial for users to track the model’s performance before contributing to it, just like how you would check previous trade prices, volume, and other metrics for a stock before buying it. Just like stocks have many sites and analysts, we envision that shared models can have many competing monitoring services to let users know if the model is corrupt and what type of data it works best with.

We want to emphasize that the research is still exploratory. We proposed a framework, a baseline, that we hope others will expand. We showed that it’s possible to easily share updatable models in a decentralized way, whether you should and how, depends on each specific scenario. The simulation tools should help you determine if your model is ready to be decentralized.

2 Likes

Thank you so kindly for taking the time to respond to the thread! In an attempt to get more discussion going about this topic, I took the liberty of responding to the questions that had been posed not knowing that you would join the thread. Please let me know if I did not represent anything accurately or if there is anything that could be better explained, as I will defer to you as the author.

3 Likes

I saw some of your answers and I think you did a great job clarifying and referring to ideas from the original paper. Thanks!

2 Likes

Thank you both, Larry and Justin, for providing such elaborate explanations for each query posted.
Just to summarize, I believe we have multiple questions posted on “good”, “bad”, and rather notorious “ambiguous” data. Perhaps the sentiment analysis doesn’t suffer much from ambiguity (although I would disagree), maybe an image classification problem pertaining to a specific scientific domain such as classification of different kinds of steels would (my research domain, which keeps me up at night :slight_smile: ). Since research work is all about finding & explaining edge / ambiguous cases, maybe we don’t want to classify them as “bad”. I would love to hear your thoughts on “what would you add” to existing model (qualitative description would suffice). I’m asking, because, I want to pursue this in the next quarter, and any comments would greatly help. And open to collaborate as well. thanks again, Amit

5 Likes

Wow, that’s a lot to digest with the many perspectives participants in this thread have already contributed. Contributing to a compendium of datasets is a good idea, however, whether a machine learning model is able to be trained based on more signal or noise is not an issue of the volume or quality of the data available. I would propose that is more of a problem of “missing feature values”.

The paper “Data Preprocessing For Supervised Learning”[1] illustrates this problem,

In many applications learners are faced with imbalanced data sets, which can cause the learners to be biased towards one class. This bias is the result of one class being heavily under-represented in the training data compared to other classes… Classes containing few examples can be largely ignored by learning algorithms because the cost of performing well on the overrepresented class outweighs the cost of doing poorly on the smaller class. Another factor contributing to bias is overfitting. Overfitting occurs when a learning algorithm creates a hypothesis that performs well over unseen data. This can occur on an underrepresented class because the learning algorithm creates a hypothesis that can easily fit a small number of examples, but fits them too specifically.

In real-world application scenarios, data sets are rarely ever balanced which can give the correct signals for any machine learning model to pick up on for a “proper” fit. That is the reason why thorough data mining for feature engineering is oftentimes more important than the fitting of the machine learning model itself. “Data Preprocessing For Supervised Learning” also listed several conventional data preprocessing techniques to tackle this issue. In the context of data mining techniques, it is also important to note that without a rigorous empirical background, data mining can also oftentimes be equivalent to “data snooping”[2]. For example, how an amateur practitioner who does not have the capability to differentiate between spontaneous and explainable results, utilizes specific data points for inferences or model selection without any prior knowledge of what the significance of the specific data point is.

“data snooping” is widely acknowledged to be a dangerous practice because it alters the asymptotic distribution of the null hypothesis in statistical studies.[2] However, since any end performance metric of a model can only be determined by the incumbent “testing data set” and the feature labels that are represented in the “testing data set”, I do not think labeling whether someone has contributed “good” or “bad” data be the subject of discussion.

Andrew ng has recently started to promote the idea of smart sized, “data centric” AI rather than the conventional “Big data” approach. Here’s a related article from [spectrum.ieee.org].(Andrew Ng: Unbiggen AI - IEEE Spectrum). In light of this, it may just be the implication that more empirical studies need to be performed by researchers rather than having the lot of commercial practitioners throw the “train test split” at an algorithm and wonder why the model is not outputting “appropriate” estimates. I would also suggest these statistical implications also be integrated into the heuristical aggregations of the smart contract of this framework

[1]Kotsiantis, S.B. et al. “Data Preprocessing For Supervised Learning”. Citeseerx.Ist.Psu.Edu, 2021, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.8413&rep=rep1&type=pdf.

[2]White, Halbert. “A Reality Check for Data Snooping.” Econometrica, vol. 68, no. 5, [Wiley, Econometric Society], 2000, pp. 1097–126, A Reality Check for Data Snooping on JSTOR.

5 Likes