TLDR
- Once deployed, smart contracts are immutable, thus, defects in the deployed code cannot be fixed. As a consequence, software engineering anti-patterns, such as code cloning, pose a threat to code quality and security if unnoticed before deployment.
- This research reports on the cloning practices of the Ethereum blockchain platform by analyzing 33,073 smart contracts amounting to over 4MLOC.
- Prior studies reported an unusually high 79.2% of code clones in Ethereum smart contracts. The current work replicates these measurements, however, at a finer level of granularity, at the level of functions instead of entire smart contracts, allowing for better precision. We report a 30.13% overall clone ratio, out of which 27.03% are exact duplicates.
- We conclude that while the current clone ratio poses elevated threats to the Ethereum platform, refactoring these clones could be relatively simple because (i) they are mostly exact copies of each other, and (ii) they tend to form hotspots in the Solidity source code.
- Our study also reveals that the ratio of clones keeps increasing in the Ethereum code base. Thus, we urge the community to take action, by building better tooling to provide native support for commonly cloned functions such as transfer and transferFrom.
Citation
F. Khan, I. David, D. Varro, and S. McIntosh, “Code Cloning in Smart Contracts on the Ethereum Platform: An Extended Replication Study,” IEEE Transactions on Software Engineering. IEEE, pp. 1–13, 2022. DOI: 10.1109/TSE.2022.3207428.
Core research questions
How frequently are verified contracts cloned?
What are the characteristics of clusters of similar verified contracts?
How frequently code blocks of verified contracts are identical to those from OpenZeppelin?
Background
- Smart contracts: Smart contracts are programs deployed on a blockchain that can be reliably executed by a network of anonymous distributed nodes without the need for a centralized trusted authority. This study assesses the cloning practices on the popular Ethereum platform.
- Verified smart contracts: To prove that a smart contract does what it is designed to do, the Etherscan service analyzes each block on the Ethereum platform and provides insights on each deployed contract. A smart contract is labeled as verified if its source code recompiled by Etherscan matches the bytecode deployed to Ethereum. This study assesses verified smart contracts.
- Code clones: A bad practice that can deteriorate many functional and extra-functional properties (e.g., security, reliability, and performance) of a software system is the abundance of duplicated source code, also known as code cloning.
- Type-1, 2, 3 clones: Type-1 clone fragments are exactly identical except for variations in whitespaces, layout, and comments. Type-2 clone fragments include Type-1 clones, but allow for differences in identifiers, literals, and data types. Type-3 clone fragments include Type-2 clones, but allow code fragments to differ in complete lines of code, thereby capturing clones with entire lines added or removed. The number of lines to be tolerated is defined by the dissimilarity threshold, in ratio with the overall code block. This study considers all three types.
- Clone granularity: Clone granularity can be either free or fixed. Free granularity clone detection considers the source code as a whole and does not make use of syntactic boundaries, such as functions, blocks, or statements. Fixed granularity, however, incorporates such syntactic units. As such, fixed granularity provides a more precise estimate of clone ratio, and is more useful than free granularity in the eventual refactoring of the duplicated code. This study assesses cloning practices on the Ethereum platform using a fixed granularity at the function level.
- NiCad: NiCad is a clone detection tool that allows for fixed granularity clone detection. (https://github.com/eff-kay/nicad6)
- Replication studies: In many domains, empirical results are considered credible only after their independent replication. Computer science is on the path of developing such good practices. This study is a conceptual replication study, i.e., sets out to answer the same research questions but uses different methods.
Summary
- Due to their immutable nature, repair in deployed smart contract code is not possible. As a consequence, bad software engineering practices—such as code cloning—pose more severe threats in blockchains than in traditional software settings.
- Clone detection tools are rarely used in the development of smart contracts. This is partly attributed to the fact that the majority of clone detection tools are designed for traditional programming languages, and only limited support exists for the novel class of programming languages targeting decentralized execution platforms, such as blockchains. As a consequence, the vast body of knowledge on clone detection in traditional programming languages, such as C, C++, and Java, cannot be exploited in programming languages used for developing smart contracts, such as Solidity for Ethereum.
- Prior work by Kondo et al. reported an unusually high 79.2% proportion of code clones on the Ethereum platform. Our work is an extended conceptual replication of their study, that is, we (i) pose the same research questions; but (ii) use different methods to answer them; and by that, (iii) refine and extend the findings of the original study.
- Extensions:
- We analyze code cloning practices at the level of function blocks, as opposed to the contract-level analysis of the original study.
- We detect near-miss (Type-3) clones, i.e., clones with modifications such as changed, added, or removed statements.
- To the best of our knowledge, this paper is the first to explore cloning in Solidity smart contracts at this finer granularity and with an awareness of these types of clones.
- To achieve this finer granularity of cloning analysis, we opt for the NiCad clone detection tool and extend it to support Solidity, the programming language of the Ethereum platform. NiCad has been frequently used for clone detection tasks in conventional software systems. (Assessing the Refactorability of Software Clones | IEEE Journals & Magazine | IEEE Xplore, Clone Detection in Test Code: An Empirical Evaluation | IEEE Conference Publication | IEEE Xplore, An Analysis of Complex Industrial Test Code Using Clone Analysis | IEEE Conference Publication | IEEE Xplore) It has been thoroughly analyzed and benchmarked in previous studies to identify optimal configuration settings for detecting clones.
Method
- Tooling: To conduct our experiment, we extended NiCad with a grammar to enable the parsing of Solidity source code. Our grammar (https://github.com/eff-kay/nicad6) is inspired by the grammar for Solidity available in ANTLR (https://github.com/antlr/grammars-v4/tree/master/solidity).
- Corpus: We use the corpus of the study conducted by Kondo et al. (Code cloning in smart contracts: a case study on verified contracts from the Ethereum blockchain platform | Empirical Software Engineering), which contains 33,073 verified smart contracts written in Solidity.
- Clone detection: The clone detection of NiCad consists of (i) parsing and extraction of potential clones, (ii) pretty-printing and normalizing, and (iii) clone clustering.
- Metadata collection: For the 33,073 verified smart contracts in the corpus, we have collected additional metainformation from the Etherscan (https://etherscan.io/) analytics platform: creation dates and author information. Both information is extracted from the transaction log of contracts.
Results
- Clone ratio
- 30.13% of the sampled corpus are clones. Specifically, 27.03% are Type-1 clones, i.e., exact duplicates.
- A small proportion of clone clusters (i.e., a group of clones with similar properties) encompass a large proportion of clones. 20% of all clusters encompass 71.9% of all clones; and half of the clones can be found in just 2.07% of clusters.
- Clone evolution
- Contracts in a clone cluster tend to be created by many authors.
- The number of clones among newly created contracts continues to increase over time. Type-1 clones increase at a higher pace than other clones.
- Cloning from OpenZeppelin
- Of all verified contracts, 21.79% have functions identical to those of OpenZeppelin. The three most cloned functions are transferFrom public returns (bool), decreaseApproval public returns (bool), and transfer. These three functions account for 73% of all clones from OpenZeppelin.
- 17 of the 20 most frequently cloned contracts are Token-related, i.e., they provide functionality for the management and provision of contacts, such as buy, sell, withdraw, refund, etc.
Discussion and Key Takeaways
- The 30.13% clone ratio is on par with the ones reported by studies on conventional software systems (Java, C++, etc).
- However, the immutability of deployed source code amplifies the threat of exploiting vulnerabilities that spread across the code base by cloning. This mechanism has been demonstrated, e.g., in the Parity Wallet Hack, in which a malicious agent drained 153,037 ETH (over 428 million USD at the time of writing the paper) from three high-profile contracts.
- Code cloning these problems could be effectively addressed by refactoring.
- Most of the clones in smart contracts are of Type-1, that is, the majority of the functions are being copied without any modifications. Type-1 clones are easier to refactor using existing clone refactoring tools.
- In addition, cloned functions tend to form hotspots in the source code: half of the clones can be found in just about 2% of clusters. This allows for easier localization of potential targets of refactoring.
- Out of the functionality that is subject to frequent cloning, token management contracts, including authorization, pose the most pressing issue. A detailed look at the cloned functions reveals that basic transaction functions such as transfer and createTokens are among the most frequently cloned.
- Providing a library of secure transfer primitives could simplify the development of such functionality.
- From a language design point of view, declarative and verifiable language constructs have been identified as potential enablers to a more secure design of smart contracts.
- The benefits of such techniques have been demonstrated in blockchain languages, such as Pact and Liquidity.
- The high entropy in authorship suggests that cloning is a widespread phenomenon on Ethereum. Such communitywide bad practices are often addressed by guidelines published by community leaders, such as the Python Enhancement Proposal (PEP) 8 style guidelines for Python.
- However, such general rules cannot be enforced in a computer-automated fashion, and a better solution could be establishing community-specific DevOps processes that include the usage of quality gates enforced by code quality tools that evaluate contracts that are ready to be deployed.
- Furthermore, we foresee the emergence of quality control as a service, provided by platform agents in exchange for compensation that is proportional to their computation investment.
- The high volume of cloning from OpenZeppelin suggests that mechanisms for reusing functionality from libraries such as OpenZeppelin could reduce the number of clones, and improve the maintainability of the overall code base. This, in turn, could improve the extra-functional properties of Ethereum, such as security, reliability, and integrity.
- Methodological takeaway: fixed granularity provides better clone estimates.
- Clone detectors of free granularity produce a higher number of false positives, e.g., identify code fragments that have been cloned with a purpose, such as getter/setter methods in Java code.
- We conjecture that the viewpoint provided by fixed granularity at the function level also enhances the applicability of the results in refactoring processes aiming to eliminate duplicated code.
Implications and Follow-ups
- As the ratio of clones reportedly keeps increasing quarter by quarter, we urge the community to take action now, and establish practices that help combat code cloning practices in the code base of Solidity smart contracts.
- Refactorings related to inheritance—such as class and method extraction, method pull up and push down—could be of particular utility. While inheritance is a supported language feature in Solidity, it is apparently underutilized, as evidenced by the high proportion of clones despite the immutability of the deployed code. This might indicate a need for better tool assistance in recognizing abstraction/inheritance opportunities.
- Opportunities in adapting traditional software engineering lifecycle models to the particularities of smart contract development should be considered as well.
Applicability
- The conclusions of this work are directly applicable to Solidity and the Ethereum platform.
- The general takeaways about the dangers of code cloning in immutable code, however, are applicable to the broader range of blockchain platforms and languages.
- The methods presented in the study are generally valid in any blockchain setting, and along with the replication package (https://github.com/software-rebels/ethereum-cloning-tse-replication-package), they provide a good starting point for replication studies.