Research Summary: Code Cloning in Smart Contracts on the Ethereum Platform: An Extended Replication Study

idavid · December 1, 2022, 2:38pm

TLDR

Once deployed, smart contracts are immutable, thus, defects in the deployed code cannot be fixed. As a consequence, software engineering anti-patterns, such as code cloning, pose a threat to code quality and security if unnoticed before deployment.

This research reports on the cloning practices of the Ethereum blockchain platform by analyzing 33,073 smart contracts amounting to over 4MLOC.

Prior studies reported an unusually high 79.2% of code clones in Ethereum smart contracts. The current work replicates these measurements, however, at a finer level of granularity, at the level of functions instead of entire smart contracts, allowing for better precision. We report a 30.13% overall clone ratio, out of which 27.03% are exact duplicates.

We conclude that while the current clone ratio poses elevated threats to the Ethereum platform, refactoring these clones could be relatively simple because (i) they are mostly exact copies of each other, and (ii) they tend to form hotspots in the Solidity source code.

Our study also reveals that the ratio of clones keeps increasing in the Ethereum code base. Thus, we urge the community to take action, by building better tooling to provide native support for commonly cloned functions such as transfer and transferFrom.

Citation

F. Khan, I. David, D. Varro, and S. McIntosh, “Code Cloning in Smart Contracts on the Ethereum Platform: An Extended Replication Study,” IEEE Transactions on Software Engineering. IEEE, pp. 1–13, 2022. DOI: 10.1109/TSE.2022.3207428.

Core research questions

How frequently are verified contracts cloned?

What are the characteristics of clusters of similar verified contracts?

How frequently code blocks of verified contracts are identical to those from OpenZeppelin?

Background

Smart contracts: Smart contracts are programs deployed on a blockchain that can be reliably executed by a network of anonymous distributed nodes without the need for a centralized trusted authority. This study assesses the cloning practices on the popular Ethereum platform.
Verified smart contracts: To prove that a smart contract does what it is designed to do, the Etherscan service analyzes each block on the Ethereum platform and provides insights on each deployed contract. A smart contract is labeled as verified if its source code recompiled by Etherscan matches the bytecode deployed to Ethereum. This study assesses verified smart contracts.
Code clones: A bad practice that can deteriorate many functional and extra-functional properties (e.g., security, reliability, and performance) of a software system is the abundance of duplicated source code, also known as code cloning.
Type-1, 2, 3 clones: Type-1 clone fragments are exactly identical except for variations in whitespaces, layout, and comments. Type-2 clone fragments include Type-1 clones, but allow for differences in identifiers, literals, and data types. Type-3 clone fragments include Type-2 clones, but allow code fragments to differ in complete lines of code, thereby capturing clones with entire lines added or removed. The number of lines to be tolerated is defined by the dissimilarity threshold, in ratio with the overall code block. This study considers all three types.
Clone granularity: Clone granularity can be either free or fixed. Free granularity clone detection considers the source code as a whole and does not make use of syntactic boundaries, such as functions, blocks, or statements. Fixed granularity, however, incorporates such syntactic units. As such, fixed granularity provides a more precise estimate of clone ratio, and is more useful than free granularity in the eventual refactoring of the duplicated code. This study assesses cloning practices on the Ethereum platform using a fixed granularity at the function level.
NiCad: NiCad is a clone detection tool that allows for fixed granularity clone detection. (https://github.com/eff-kay/nicad6)
Replication studies: In many domains, empirical results are considered credible only after their independent replication. Computer science is on the path of developing such good practices. This study is a conceptual replication study, i.e., sets out to answer the same research questions but uses different methods.

Summary

Due to their immutable nature, repair in deployed smart contract code is not possible. As a consequence, bad software engineering practices—such as code cloning—pose more severe threats in blockchains than in traditional software settings.
Clone detection tools are rarely used in the development of smart contracts. This is partly attributed to the fact that the majority of clone detection tools are designed for traditional programming languages, and only limited support exists for the novel class of programming languages targeting decentralized execution platforms, such as blockchains. As a consequence, the vast body of knowledge on clone detection in traditional programming languages, such as C, C++, and Java, cannot be exploited in programming languages used for developing smart contracts, such as Solidity for Ethereum.
Prior work by Kondo et al. reported an unusually high 79.2% proportion of code clones on the Ethereum platform. Our work is an extended conceptual replication of their study, that is, we (i) pose the same research questions; but (ii) use different methods to answer them; and by that, (iii) refine and extend the findings of the original study.
Extensions:
- We analyze code cloning practices at the level of function blocks, as opposed to the contract-level analysis of the original study.
- We detect near-miss (Type-3) clones, i.e., clones with modifications such as changed, added, or removed statements.
- To the best of our knowledge, this paper is the first to explore cloning in Solidity smart contracts at this finer granularity and with an awareness of these types of clones.
To achieve this finer granularity of cloning analysis, we opt for the NiCad clone detection tool and extend it to support Solidity, the programming language of the Ethereum platform. NiCad has been frequently used for clone detection tasks in conventional software systems. (Assessing the Refactorability of Software Clones | IEEE Journals & Magazine | IEEE Xplore, Clone Detection in Test Code: An Empirical Evaluation | IEEE Conference Publication | IEEE Xplore, An Analysis of Complex Industrial Test Code Using Clone Analysis | IEEE Conference Publication | IEEE Xplore) It has been thoroughly analyzed and benchmarked in previous studies to identify optimal configuration settings for detecting clones.

Method

Tooling: To conduct our experiment, we extended NiCad with a grammar to enable the parsing of Solidity source code. Our grammar (https://github.com/eff-kay/nicad6) is inspired by the grammar for Solidity available in ANTLR (https://github.com/antlr/grammars-v4/tree/master/solidity).
Corpus: We use the corpus of the study conducted by Kondo et al. (Code cloning in smart contracts: a case study on verified contracts from the Ethereum blockchain platform | Empirical Software Engineering), which contains 33,073 verified smart contracts written in Solidity.
Clone detection: The clone detection of NiCad consists of (i) parsing and extraction of potential clones, (ii) pretty-printing and normalizing, and (iii) clone clustering.
Metadata collection: For the 33,073 verified smart contracts in the corpus, we have collected additional metainformation from the Etherscan (https://etherscan.io/) analytics platform: creation dates and author information. Both information is extracted from the transaction log of contracts.

Results

Clone ratio
- 30.13% of the sampled corpus are clones. Specifically, 27.03% are Type-1 clones, i.e., exact duplicates.
- A small proportion of clone clusters (i.e., a group of clones with similar properties) encompass a large proportion of clones. 20% of all clusters encompass 71.9% of all clones; and half of the clones can be found in just 2.07% of clusters.
Clone evolution
- Contracts in a clone cluster tend to be created by many authors.
- The number of clones among newly created contracts continues to increase over time. Type-1 clones increase at a higher pace than other clones.
Cloning from OpenZeppelin
- Of all verified contracts, 21.79% have functions identical to those of OpenZeppelin. The three most cloned functions are transferFrom public returns (bool), decreaseApproval public returns (bool), and transfer. These three functions account for 73% of all clones from OpenZeppelin.
- 17 of the 20 most frequently cloned contracts are Token-related, i.e., they provide functionality for the management and provision of contacts, such as buy, sell, withdraw, refund, etc.

Discussion and Key Takeaways

The 30.13% clone ratio is on par with the ones reported by studies on conventional software systems (Java, C++, etc).
- However, the immutability of deployed source code amplifies the threat of exploiting vulnerabilities that spread across the code base by cloning. This mechanism has been demonstrated, e.g., in the Parity Wallet Hack, in which a malicious agent drained 153,037 ETH (over 428 million USD at the time of writing the paper) from three high-profile contracts.

Code cloning these problems could be effectively addressed by refactoring.
- Most of the clones in smart contracts are of Type-1, that is, the majority of the functions are being copied without any modifications. Type-1 clones are easier to refactor using existing clone refactoring tools.
- In addition, cloned functions tend to form hotspots in the source code: half of the clones can be found in just about 2% of clusters. This allows for easier localization of potential targets of refactoring.

Out of the functionality that is subject to frequent cloning, token management contracts, including authorization, pose the most pressing issue. A detailed look at the cloned functions reveals that basic transaction functions such as transfer and createTokens are among the most frequently cloned.
- Providing a library of secure transfer primitives could simplify the development of such functionality.
- From a language design point of view, declarative and verifiable language constructs have been identified as potential enablers to a more secure design of smart contracts.
- The benefits of such techniques have been demonstrated in blockchain languages, such as Pact and Liquidity.
- The high entropy in authorship suggests that cloning is a widespread phenomenon on Ethereum. Such communitywide bad practices are often addressed by guidelines published by community leaders, such as the Python Enhancement Proposal (PEP) 8 style guidelines for Python.
- However, such general rules cannot be enforced in a computer-automated fashion, and a better solution could be establishing community-specific DevOps processes that include the usage of quality gates enforced by code quality tools that evaluate contracts that are ready to be deployed.
- Furthermore, we foresee the emergence of quality control as a service, provided by platform agents in exchange for compensation that is proportional to their computation investment.
The high volume of cloning from OpenZeppelin suggests that mechanisms for reusing functionality from libraries such as OpenZeppelin could reduce the number of clones, and improve the maintainability of the overall code base. This, in turn, could improve the extra-functional properties of Ethereum, such as security, reliability, and integrity.
Methodological takeaway: fixed granularity provides better clone estimates.
- Clone detectors of free granularity produce a higher number of false positives, e.g., identify code fragments that have been cloned with a purpose, such as getter/setter methods in Java code.
- We conjecture that the viewpoint provided by fixed granularity at the function level also enhances the applicability of the results in refactoring processes aiming to eliminate duplicated code.

Implications and Follow-ups

As the ratio of clones reportedly keeps increasing quarter by quarter, we urge the community to take action now, and establish practices that help combat code cloning practices in the code base of Solidity smart contracts.
Refactorings related to inheritance—such as class and method extraction, method pull up and push down—could be of particular utility. While inheritance is a supported language feature in Solidity, it is apparently underutilized, as evidenced by the high proportion of clones despite the immutability of the deployed code. This might indicate a need for better tool assistance in recognizing abstraction/inheritance opportunities.
Opportunities in adapting traditional software engineering lifecycle models to the particularities of smart contract development should be considered as well.

Applicability

The conclusions of this work are directly applicable to Solidity and the Ethereum platform.
The general takeaways about the dangers of code cloning in immutable code, however, are applicable to the broader range of blockchain platforms and languages.
The methods presented in the study are generally valid in any blockchain setting, and along with the replication package (https://github.com/software-rebels/ethereum-cloning-tse-replication-package), they provide a good starting point for replication studies.

cipherix · December 7, 2022, 7:57pm

This analysis is fascinating, David – thank you for sharing the key insights with us!

One of the things I’ve noticed when analyzing ERC20 contracts is that there has been been a noticeable increase in the standardization of functions and overall structure. For example, the BAT ERC20 token implementation from 2017 is very different than the SHIB ERC20, the latter following a more frequently-used structure these days.

A question that came to mind when reading the summary: could standardization account for code cloning? Another interesting example is Open Zeppelin’s Defender interface, which implements Admin functionality. This particular product now accounts for the majority of application Admins. Put differently, could this be a good sign of developers converging on standardized implementations for common functionality instead of reinventing the wheel?

Ulysses · December 8, 2022, 9:31pm

@idavid, this is a clear disadvantage of open-sourcing and the immutability of smart contracts.

I understand that immutability is one of the factors encouraging this anomaly. So how about proxy contracts known to be upgradeable. Will the issue of cloning still be a problem in this kind of contracts too?

By the way, thanks for making this concise.

idavid · December 19, 2022, 12:02pm

Hi @Ulysses, thanks for raising an interesting point. Immutability is not the source of the problem per se, it only exacerbates the problem of code cloning frequently observed in traditional software systems as well (with replaceable/upgradeable deployed code).

Indeed, multiple upgrade mechanisms exist for Ethereum now, which allow for replacing faulty code as one would do in a traditional software system. However, we found that the clone proportion in Solidity code is on par with the code proportion in traditional software systems. Thus, making smart contracts upgradeable will probably not impact the clone proportion in the code base directly, but it will likely mitigate certain risks.

I hope this answers your interesting question.

Ulysses · December 19, 2022, 12:27pm

@idavid yes it does.

I’m currently working on a paper about the upgradeability of smart contracts. The resource you shared on upgrade mechanisms will also go a long way. Thank you!

idavid · December 19, 2022, 12:52pm

Hi @cipherix, thank you for the interesting insight and question.

I agree that standardization efforts and community practices certainly are steps in the right direction considering the overall quality and health of the active code base. However, at the end of the day, developers are still independent professionals who are free to write their own source code as they prefer. Coding standards and guidelines might be hard to enforce, unless done in an automated fashion, for example, during the CI/CD process.

I agree with you, such efforts are very encouraging. Converging to an informal standard might also be an artifact of the community becoming more mature and developers spending more time designing their code. Perhaps more information and better documentation is available nowadays—these also help making better design decisions.

idavid · December 19, 2022, 12:59pm

@Ulysses, good luck with your paper! Please, shoot me an email once there’s a preprint available, I’m very much interested in the topic.

Ulysses · December 19, 2022, 3:56pm

@idavid, It’s actually a summary contribution for SCRF, I’m not the original author. Here is the orginal work.

Yeoriton56 · December 23, 2022, 6:30pm

Hi @idavid. According to your definition of code cloning, it’s a bad practice in smart contracts. Although in a recent paper I read, code cloning technique can be used to detect smart contract vulnerabilities. I mean vulnerabilities such as Reentrancy, Denial of Service, Gas Exception and the likes of them.

I was a bit confused about how code cloning can deteriorate functional properties of a smart contract and at the same time serves as a means of detecting smart contract vulnerabilities… Please can you share your ideas on this.

idavid · December 24, 2022, 3:42pm

Hi there @Yeoriton56, thanks for raising an interesting point.

We do not assert that code cloning is a bad practice in smart contracts. Our investigation is based on the assumption that code cloning, in general, has adverse effects on source code.

It’s not code cloning that can be used to detect vulnerabilities, but rather, code clone detection. The cited paper says: “Ethereum smart contracts can benefit from code clone detection techniques, particularly in vulnerability signature generation and identification of its variations in different versions of Solidity programming language. Therefore, following this literature review, we developed a framework that use code clone detection technique for identifying vulnerabilities and their variations in smart contracts.”

The authors argue that code cloning leads to elevated threats of vulnerabilities. Therefore, they suggest finding those clones (detection) and understanding the vulnerabilities they contribute to. It’s not that one should clone source code to analyze vulnerabilities.

I’d like to note that there indeed exist benefits of code cloning, such as rapid bug workarounds (Kapser and Godfrey, 2008), supporting code development in siloed organizations (Cordy, 2003), and a more concise code structure (thing of getters-setters in Java code). Beneficial code cloning is usually intentional, mostly due to language restrictions (Kim et al., 2004), and it is an artifact of careful design considerations. However, apart from these special cases, code cloning, in general, is unintentional, complicates maintenance, testability, evolution, refactoring, etc. (Chatterji et al., 2013; Rattan et al., 2013; Koschke, 2008). For example, when a security vulnerability becomes public, its clones can be easily exploited. This is especially concerning in cases when deployed code is immutable and its fix is not trivial—which is the default mechanism on blockchain platforms.

I hope this answers your question.

Yeoriton56 · December 24, 2022, 8:07pm

Yes it did answer my question. Thanks for the detailed explanation. I now have a better understanding of code cloning.

Humphery · December 29, 2022, 10:32pm

@cipherix Yes, the standardization of functions and overall structure in ERC20 contracts could account for code cloning. As you mentioned, the BAT ERC20 token implementation from 2017 is very different than the SHIB ERC20, which follows a more frequently-used structure. This could be an example of developers converging on a standardized implementation for common functionality rather than reinventing the wheel. Similarly, the adoption of Open Zeppelin’s Defender interface as a way to implement Admin functionality could also be seen as a sign of developers converging on a standardized implementation for this particular functionality.

Standardization can make it easier for developers to reuse code and build on top of existing work, rather than starting from scratch. This can help to save time and resources and can also help to ensure that different implementations of similar functionality are compatible and work well together. However, it is important to be aware of the potential risks associated with code cloning and to thoroughly review and understand the code that is being used, whether it is original code or code that has been cloned from another source.

I hope this helps

Topic		Replies	Views
Research Summary: A large-scale empirical study of low-level function use in Ethereum smart contracts and automated replacement Tooling and Languages summary	1	483	January 10, 2023
Research Pulse Issue #28 08/30/21 Research Pulse	1	570	August 30, 2021
Research Summary: Attacks on Smart Contracts Auditing and Security summary , network-security	28	4305	January 2, 2023
Research Pulse Issue #29 09/07/21 Research Pulse	4	921	December 11, 2021
Research Pulse Issue #13 05/14/21 Research Pulse	1	1133	May 14, 2021