Research Summary: An Empirical Study of Blockchain Repositories in GitHub

TLDR

  • The authors analyze 3,664 blockchain repositories to glean data-backed insights for ecosystem development.
  • One key insight is that although the blockchain repositories are open source, there is limited autonomy because organizations control more than 50% of repositories.
  • Findings from the research may seed further research on tools for scaling the blockchain ecosystem, improving autonomy, and advancing quality assurance.

Core Research Question

What data-backed lessons can the ecosystem glean from an empirical study of blockchain repositories on GitHub?

Citation

Das, Ajoy, et al. “An Empirical Study of Blockchain Repositories in GitHub.” The International Conference on Evaluation and Assessment in Software Engineering 2022, 13 June 2022, [2205.08087] An Empirical Study of Blockchain Repositories in GitHub

Background

  • Empirical Analysis: A method of research built on the study and interpretation of real-world data.
  • GitHub: A platform used for collaboration by developers and anyone writing code or content requiring version control.
  • Repository (“repo”): A central place on GitHub, sometimes called a directory or storage for projects that are used to host and maintain all project files from code to images in an organized way.
  • Application: A computer program (mobile or web) that carries out specific tasks. Example MetaMask, Coinbase, Binance.
  • Software: A set of computer instructions for hardware.
  • SDKs: A Software Development Kit (SDK) comprising a set of tools and programs used by developers to create an application for a hardware or Operating System.
  • Commit: A recorded change made to a file in a branch on GitHub.
  • Issues: A way to propose and track changes on GitHub.
  • Open Source Software (OSS): Software released under a copyright license that allows anyone to use as they wish, including making changes or even monetizing. Usually, OSS materializes from collaborations.
  • Pull Requests: A way to propose changes that could be merged with a branch or the main branch in a repository on GitHub.

Summary

  • In 2014, Marc Andreessen made a prediction about institutional involvement in blockchain. He predicted more funding for non-financial use cases of blockchain. By 2016, funding for blockchain overtook bitcoin funding. This helped to further expand the technology’s use case beyond cryptocurrencies.
  • The majority of blockchain-based solutions are open-sourced, with their repositories (repo hereafter) hosted on GitHub.
  • Prior research on blockchain repos was industry-specific. The authors claim this is the first research to conduct an empirical study of the general state and interactions of GitHub blockchain repos.

Method

  • Preliminary Steps
    • The authors use a set of 86 keywords to search GitHub’s database of over 200 million repos (as of July 2021) to return 802,000 repos that are potentially related to blockchain.
    • Using metrics like size, popularity, activity, data availability and content, 802,000 repos were filtered down to 5200 repos.
    • After non-source code and irrelevant repositories were discarded, it left the authors with the 3,664 repos on which this study is based.
  • Phase 1
    • The authors manually label the 3,664 blockchain repos studied, under three broad categories: Tools, Applications (crypto) and Applications (others) - for non-crypto.
    • Using an automated approach, only 25% of the dataset was labeled, thus the authors fell back on manual categorization with the help of three human coders.
  • Phase 2
    • The authors investigate these categories using GitHub repo-based metrics to observe the status quo and interactions.
    • The replication package can be viewed here: GitHub - disa-lab/BlockchainEmpiricalEASE2022.
    • For a comparative analysis of blockchain vs non-blockchain repos, the authors randomly picked an equal number (3,664) of non-blockchain repos.

Results

  • The authors found that Ethereum and Bitcoin blockchain platforms have the highest record of projects.
  • Blockchain repos that fall under the “tools” category had the highest number of activities.
  • In terms of contributions, organizations contribute more to blockchain repos than individuals do.
  • Blockchain repos in the tools category show a higher degree of collaboration than repos in other categories.
  • On autonomy, 40% of non-blockchain repos have autonomous users, higher than the 33% of blockchain repos.
  • Most blockchain repo contributors are not autonomous, meaning there are restrictions when attempting to make changes or updates.
  • Using metrics such as stars and forks, the authors found that Ethereum (31.7%) and Bitcoin (15.9%) are the most popular blockchain platform with multiple projects.
  • Using metrics such as commits, issues, and pull requests, the authors found that blockchain repos in the tools category show more activity than in other categories.
  • A look at the ownership data on GitHub showed that 58.6% of the blockchain repos the authors studied are owned by organizations, while the rest belong to individuals.
  • Using interaction types such as code contribution, maintenance, process, review and discussion, the authors found that blockchain repos in the tools category had a higher degree of collaboration.
  • Among the five interaction types mentioned in the study, the authors observe that commit contributions are higher than issues or pull requests in each of the three blockchain repo categories.
  • To understand the degree of autonomy, the authors divide users into (1) Maintainers, (2) Autonomous contributors, and (3) Dependent contributors.
  • The authors then found that the application (crypto) category has a higher proportion of maintainers than other categories. Also, across each category, autonomous contributors have the lowest proportion.
  • This means that blockchain repos have less autonomy because of the restrictions associated with approving bug fixes and network updates.
  • According to findings by the authors, more blockchain repos than non-blockchain repos get archived. However, in the period between 2017 and 2018, a higher number of blockchain projects were created and also archived. The bitcoin peak period of January 2018 might be responsible for this.

Discussion and Key Takeaways

  • By analyzing 7,328 GitHub repos, an equal number of blockchain and non-blockchain repos, the authors found that blockchain repos have a higher degree of activity in terms of commits, issues, and pull requests.
  • According to the authors, some reasons for this high activity include larger funding, rapidly developing domain and the need to uphold its fault-tolerant reputation by carrying out rigorous code reviews.
  • In comparing users, the authors find ‌organizations own 58.6% of blockchain repos, while only 24.9% of non-blockchain repos belong to organizations.

Implications and Follow-ups

  • Blockchain repo owners/vendors (organizations and individuals) need to improve the autonomy of contributors.
  • Authors claim this study can help with tracking trends in the blockchain ecosystem and is useful for deciding on a career change or ecosystem investment.
  • The research can seed further research on tools for quality assurance and improved security of blockchain repos.

Applicability

  • Furthering blockchain research on issues relating to scalability, throughput and vulnerabilities.
  • Building better quality assurance tools. Because of the impact bug fixes and code updates can have on a network, there is an incentive for making them perfect before they are approved or merged. Quality assurance tools can help to shorten the time.
  • This research is useful for new projects at the stage of considering platforms to build. Identifying platforms with a high level of activity would help give the assurance that there will be help when stuck. A high level of autonomy means the platform welcomes new ideas.
  • Advances the case for more autonomy in blockchain repos.
4 Likes

@Fizzymidas Thank you for the summary!

I wonder if the authors found it concerning that most repos are owned or maintained at the organizational level?

It seems that centralization would be useful for carrying out responsible disclosure practices, which is common for modern software development - When an independent security researcher uncovers a bug, they report it to the centralized maintainer, who will then privately develop a patch to fix the flaw and provide an update.

Had the organization not played the role as outlined above, it would have been difficult to report and fix a bug. If it was done in a more decentralized manner, the information would be at risk of being exploited.

This is my initial interpretation of the phenomenon. What are your thoughts?

4 Likes

Nice read @Fizzymidas!

This is one of the most important parts of this research paper. But one question also needs to be answered to appreciate this idea better.

Will autonomy be advantageous or disadvantageous on the long run?

Consisering your observation above and @Twan’s example above, this leaves us at a crossroad.

1 Like

Thank you @Fizzymidas , for this insightful summary, I see that this paper demonstrates the evolution of Blockchain technology from the perspective of OSS development in GitHub.

I think the author conducted an extensive survey on Blockchain taxonomy, consensus algorithms, and
technical challenges and even investigate the challenges and potential applications for block-chain related IoT applications but did not analyze the development activity of the Blockchain projects.

I am very much heart-lighted that researchers were able to present a tool such as Vandal to automatically detect security vulnerabilities in Blockchain smart contacts, that is really a fantastic development. It is really interesting to know that Blockchain repo owners in GitHub can create an environment with less restriction so that users can enjoy more autonomy while creating features.

Many Blockchain projects are open-sourced to promote rapid growth and adoption. As such, we find an increasing number of Blockchain-based software repositories.

I think there is the need to concentrate on additional reasons for low independence in the Blockchain repo and foster rules and apparatuses to further develop the user independence

My take the moment.

2 Likes