TLDR
- The authors analyze 3,664 blockchain repositories to glean data-backed insights for ecosystem development.
- One key insight is that although the blockchain repositories are open source, there is limited autonomy because organizations control more than 50% of repositories.
- Findings from the research may seed further research on tools for scaling the blockchain ecosystem, improving autonomy, and advancing quality assurance.
Core Research Question
What data-backed lessons can the ecosystem glean from an empirical study of blockchain repositories on GitHub?
Citation
Das, Ajoy, et al. “An Empirical Study of Blockchain Repositories in GitHub.” The International Conference on Evaluation and Assessment in Software Engineering 2022, 13 June 2022, [2205.08087] An Empirical Study of Blockchain Repositories in GitHub
Background
- Empirical Analysis: A method of research built on the study and interpretation of real-world data.
- GitHub: A platform used for collaboration by developers and anyone writing code or content requiring version control.
- Repository (“repo”): A central place on GitHub, sometimes called a directory or storage for projects that are used to host and maintain all project files from code to images in an organized way.
- Application: A computer program (mobile or web) that carries out specific tasks. Example MetaMask, Coinbase, Binance.
- Software: A set of computer instructions for hardware.
- SDKs: A Software Development Kit (SDK) comprising a set of tools and programs used by developers to create an application for a hardware or Operating System.
- Commit: A recorded change made to a file in a branch on GitHub.
- Issues: A way to propose and track changes on GitHub.
- Open Source Software (OSS): Software released under a copyright license that allows anyone to use as they wish, including making changes or even monetizing. Usually, OSS materializes from collaborations.
- Pull Requests: A way to propose changes that could be merged with a branch or the main branch in a repository on GitHub.
Summary
- In 2014, Marc Andreessen made a prediction about institutional involvement in blockchain. He predicted more funding for non-financial use cases of blockchain. By 2016, funding for blockchain overtook bitcoin funding. This helped to further expand the technology’s use case beyond cryptocurrencies.
- The majority of blockchain-based solutions are open-sourced, with their repositories (repo hereafter) hosted on GitHub.
- Prior research on blockchain repos was industry-specific. The authors claim this is the first research to conduct an empirical study of the general state and interactions of GitHub blockchain repos.
Method
- Preliminary Steps
- The authors use a set of 86 keywords to search GitHub’s database of over 200 million repos (as of July 2021) to return 802,000 repos that are potentially related to blockchain.
- Using metrics like size, popularity, activity, data availability and content, 802,000 repos were filtered down to 5200 repos.
- After non-source code and irrelevant repositories were discarded, it left the authors with the 3,664 repos on which this study is based.
- Phase 1
- The authors manually label the 3,664 blockchain repos studied, under three broad categories: Tools, Applications (crypto) and Applications (others) - for non-crypto.
- Using an automated approach, only 25% of the dataset was labeled, thus the authors fell back on manual categorization with the help of three human coders.
- Phase 2
- The authors investigate these categories using GitHub repo-based metrics to observe the status quo and interactions.
- The replication package can be viewed here: GitHub - disa-lab/BlockchainEmpiricalEASE2022.
- For a comparative analysis of blockchain vs non-blockchain repos, the authors randomly picked an equal number (3,664) of non-blockchain repos.
Results
- The authors found that Ethereum and Bitcoin blockchain platforms have the highest record of projects.
- Blockchain repos that fall under the “tools” category had the highest number of activities.
- In terms of contributions, organizations contribute more to blockchain repos than individuals do.
- Blockchain repos in the tools category show a higher degree of collaboration than repos in other categories.
- On autonomy, 40% of non-blockchain repos have autonomous users, higher than the 33% of blockchain repos.
- Most blockchain repo contributors are not autonomous, meaning there are restrictions when attempting to make changes or updates.
- Using metrics such as stars and forks, the authors found that Ethereum (31.7%) and Bitcoin (15.9%) are the most popular blockchain platform with multiple projects.
- Using metrics such as commits, issues, and pull requests, the authors found that blockchain repos in the tools category show more activity than in other categories.
- A look at the ownership data on GitHub showed that 58.6% of the blockchain repos the authors studied are owned by organizations, while the rest belong to individuals.
- Using interaction types such as code contribution, maintenance, process, review and discussion, the authors found that blockchain repos in the tools category had a higher degree of collaboration.
- Among the five interaction types mentioned in the study, the authors observe that commit contributions are higher than issues or pull requests in each of the three blockchain repo categories.
- To understand the degree of autonomy, the authors divide users into (1) Maintainers, (2) Autonomous contributors, and (3) Dependent contributors.
- The authors then found that the application (crypto) category has a higher proportion of maintainers than other categories. Also, across each category, autonomous contributors have the lowest proportion.
- This means that blockchain repos have less autonomy because of the restrictions associated with approving bug fixes and network updates.
- According to findings by the authors, more blockchain repos than non-blockchain repos get archived. However, in the period between 2017 and 2018, a higher number of blockchain projects were created and also archived. The bitcoin peak period of January 2018 might be responsible for this.
Discussion and Key Takeaways
- By analyzing 7,328 GitHub repos, an equal number of blockchain and non-blockchain repos, the authors found that blockchain repos have a higher degree of activity in terms of commits, issues, and pull requests.
- According to the authors, some reasons for this high activity include larger funding, rapidly developing domain and the need to uphold its fault-tolerant reputation by carrying out rigorous code reviews.
- In comparing users, the authors find ‌organizations own 58.6% of blockchain repos, while only 24.9% of non-blockchain repos belong to organizations.
Implications and Follow-ups
- Blockchain repo owners/vendors (organizations and individuals) need to improve the autonomy of contributors.
- Authors claim this study can help with tracking trends in the blockchain ecosystem and is useful for deciding on a career change or ecosystem investment.
- The research can seed further research on tools for quality assurance and improved security of blockchain repos.
Applicability
- Furthering blockchain research on issues relating to scalability, throughput and vulnerabilities.
- Building better quality assurance tools. Because of the impact bug fixes and code updates can have on a network, there is an incentive for making them perfect before they are approved or merged. Quality assurance tools can help to shorten the time.
- This research is useful for new projects at the stage of considering platforms to build. Identifying platforms with a high level of activity would help give the assurance that there will be help when stuck. A high level of autonomy means the platform welcomes new ideas.
- Advances the case for more autonomy in blockchain repos.