Research Summary: Leveraging Blockchain for Greater Accessibility of Machine Learning Models

Thanks for trying to answer my questions. I appreciate that you took it seriously. Your discussion with xiaotoshi is also insightful. Here’s my follow up:


It is claimed that when the training set is not public, an attack is not possible. That is not obvious to me. To my knowledge, access to the dataset is not strictly necessary for a successful attack.

The breakthrough had been around since 2016.

The authors successfully attacked models hosted by Amazon and Google (without knowing what it is trained on), demonstrating how vulnerable seemingly powerful AIs could be.

As for the notion that there wouldn’t be enough funds, this is also addressed by another paper:

Our results are alarming: even on the state-of-the-art systems trained with massive parallel data (tens of millions), the attacks are still successful (over 50% success rate) under surprisingly low poisoning budgets (e.g., 0.006%).

This shows that data poisoning is achievable at a fraction of the whole pool.

Although this protocol makes attacks easier not only by letting anyone deposit data, but also provides a strong incentive for people to trick the model so that they could profit from not only the normal rate of reward but also the fees of other contributors.

Adversarial machine learning attacks are not the only threat in this protocol.

Another problem emerges from the risk of over-fitting.

For those who might be reading this comment be doesn’t know what over-fitting is in the AI/ML/DL context, here’s an analogy to put it plainly:

You let the students (AI model) practice the same test bank (dataset) over and over again (train many rounds).

Humans would gradually improve as they practice more. However, machines operate a little differently. It is observed that if trained too long on a dataset, it leads them to draw wrong inferences.

Take this for an example: when the machine is what the color of the flower is (and showed a picture of a sunflower), it can correctly answer “yellow”.

Although at surface answered correctly it was yellow, explainable AI analysis found that the machine just look for the word “flower” and answered “yellow”, not taking into the context of the question.

The model mislearns when they are trained on similar data, over and over.

So if the reward is based on whether the model deems the data correct, then the rational thing to do for someone seeking minimal cost and maximum returns, is to submit the same type of data repeatedly with only small modifications.

That would be unideal. This is exactly why high-quality data is so important to performance. Yet the protocol goes for quantity over quality.

4 Likes