Training InstructGPT on web3 stack: challenges and opportunities :

This will be multi-part series wherein the first article, I will give an introduction about:

  • The GPT model
  • What are the pros and challenges of training this category of models in the web3 ecosystem
  • And giving an intro to the vertical of the services that can be used to train these model


GPT models were first introduced by openAI in 2018 in the paper “Improving Language Understanding by Generative Pre-Training”, where they discussed the challenges of scaling current supervised large-scale deep learning models to be applied on the datasets that are scarce resources (for instance on-chain data of NFT’s, smart contracts, etc ). This model proposed 2 stages of training models:

  • The generative pre-training stage (unsupervised) trains the model intending to understand the general characteristics of the language model.
  • Discriminative “Fine-tuning” stage (supervised) where parameters were adapted to the specific tasks.

The model core components are made from a type of deep learning model called transformer, which recalibrates the training of each part of the input data across the layers as a function of the significance of each part of input information with the context overall.

The data used for train the models are generally diverse, from the BookCorpus (using the 7000 fiction books) in order to understand the language model, and then trained in semi-supervised way via the Wikipedia corpus in order to generate the comprehensive content.

Since the GPT model took the world by storm in 2019, there had been gradual improvements in the model to generate results based on the few examples delivered by $ humans. But still, the increase in the performance was attributed to the huge increase of the model parameters (1.5 billion to 175 billion parameters ) and the training data (around 500 billion tokens for GPT-3, taking approx 90% of the written literature on the web).

But the performance of the model was an overkill due to the:

  1. the huge compute costs for analysis to generate the meaningful results
  2. Also, the result was not precise enough based on the intention of the user.

Thus there needs to be innovation in the model that reduces the size overhead and tackles the above issues, this was first addressed by the openAI paper introducing InstructGPT, that used essentially used reinforcement learning based on human feedback, which allows to optimize the predictions of the given conversation based on the principles of reinforcement learning.

In this case, first, the model is prompted and then fine-tuned by the labeler to define the

Web3 and instruct GPT, An enormous opportunity:

Given in terms of traditional web2, web3 still is a nascent ecosystem but has generated lot of interesting public information to research like:

  • Onchain transaction data with sparse reference information.
  • Some textual metadata information from Dapps relating to governance etc.
  • Details about the staking and investment strategies in DeFI
  • Smart contracts and their hacks database and security reports.
  • Data marketplaces for the open dataset (like being the )
  • etc……

The biggest positive point that web3 addresses is the nature of data is being verified by the consensus, and thus there can be better addressed of the issue faced by the current models on the datasets being maintained by the siloed databases (primarily on reducing the biases in the datasets).

But given the nature of GPT being compute-intensive and needing good quality labeled data, along with the massive storage and bandwidth costs in order to manage the training and compute framework to serve the application at scale.

The above challenge can be addressed by web3-based applications, thanks to the available of various intermediaries like:

  1. Decentralised storages like IPFS and filecoin : there has been immense scaling of these decentralized storages to meet the demands of evergrowing analysis of the web3 datapoints as explained above. Since its inception in 2018, filecoin has been able to provide more resilient storage for big organizations like CERN to store their particle research experiments along with immutable proofs of data redundancy at high-efficiency thanks to the ZKProofs.

  2. Decetralised Compute : Project Bacalau is an initiative to run the massively distributed docker containers which run the docker / WASM runtimes on top of decentralized storage like IPFS and filecoin mentioned above. Given that GPT-2 and other models can indeed be containerized, there seems to be the possibility for eventually running the

  3. DeFI for incentivizing the labelers and hosting of the infrastructure : Currently web2 has large-scale services for providing labeling for the datasets needed for these models, but they are plagued with the lack of transparency of the labeling data, along with the sub minimal pay to the workers and mistreatment (source). DeFI can solve this issue by incentivizing the labelers to participate in the various category of labeling the web3 data by getting assured returns

  4. For instance, creating a staking of the payment for the labelers and based on their work tasks, their payments will appreciate eventually

  5. Giving them NFT badges certifying their level of labeling the data and datasets they have worked on.

This is the higher-level introductory view about the “most discussed” topic in the Deep learning space, and the various approaches that web3 can take in order to start building the ecosystem to provide “web3 native” AI model ecosystem.

Thanks in advance for the feedback about my first post on the SCRF.