Research Summary: Machine Learning Guided Cross-Contract Fuzzing


#auditing-security #tooling-lang #scaling

By Tony Siu


  • xFuzz is a machine-learning guided smart contract fuzzing framework. It identifies three common cross-contract vulnerabilities: reentrancy, delgatedcall, and tx-origin. It also addresses cross-contract vulnerabilities where smart contracts are intertwined between calls of only two or more contracts at a time.
  • The authors compare xFuzz to three other state-of-the-art tools on 7,391 contracts. xFuzz detected 18 exploitable cross-contract vulnerabilities. Of the 18, 15 were vulnerabilities that no other state-of-the-art tool had detected.
  • The framework was also effective in detecting non-cross-contract vulnerabilities, taking 20% less time in comparison to xFuzz.

Core Research Question

What is a possible framework to address the issues of cross-contract vulnerabilities in a search space consisting of more than two intertwined smart contract dependencies?


Xue, Yinxing et al. Machine Learning Guided Cross-Contract Fuzzing. Arxiv.Org, 2022,, Accessed 20 Jan 2022.


Smart contract transactions increasingly involve cross-contract calls. However, existing tools for guarding smart contracts overlook cross-contract vulnerabilities – i.e., exploitable bugs that manifest in the presence of more than two interacting contracts. Existing methods are limited to analyzing a maximum of two contracts at the same time because the search space of multiple interacting contracts is so vast.

To address this problem, Machine Learning Guided Cross-Contract Fuzzing presents xFuzz, a new machine learning-guided smart contract fuzzing framework. Compared with existing static tools that use manually-defined rules, xFuzz’s machine learning model was significantly more robust. In a comparison with three state-of-the-art tools on 7,391 contracts, xFuzz detected 15 vulnerabilities (out of a total of 18) for the first time. Furthermore, xFuzz detected twice as many non-cross-contract vulnerabilities as other tools in less than 20% of the total time.

  • Precision: also known as a positive predictive value, is a concept employed in machine learning that relates to the portion of documents retrieved which were relevant to the study. In this work, Precision is formally defined as (\frac{\text{True Positives}}{\text{True Positives + False Positives}})

  • Recall: also known as sensitivity, relates to the fraction of the relevant documents that were successfully retrieved. Recall is formally defined in the paper as (\frac{\text{True Positives}}{\text{True Positives + False Negatives}})

  • Decision Trees(DTs): Decision Trees are typically used for classification and regression. The goal is to model data values into simple decision rules. This model can be used to infer incumbent predictor variables following a tree-like decision structure.

  • Smart Contracts: Smart contracts are programs stored on a blockchain that run when predetermined conditions are met.

  • Fuzzing tools: Fuzzing tools are tools that are used for software testing to find program implementation bugs.

  • Static analyzers: Static analyzers are tools that automate the analysis of source code without executing the application.

  • Dynamic analyzers: Dynamic analyzers are tools that test and evaluate the source code of an application during runtime.

  • Abstract syntax trees (ASTs): Abstract syntax trees or (syntax trees) are representational structures used to depict the syntactic structure of source code or text files of a particular language.

  • Context-free grammar (CFGs): Context-free grammars are used to describe languages that are bound by a set of recursive rules for its generation of pattern strings.

  • Benign Contracts: Benign contracts are smart contracts that are unlikely to be deemed cross contract vulnerable in the contexts of TXORIGIN, DELEGATECALL, and Reentrancy vulnerabilities.

  • Condition: A condition in the context of this paper is the programmatic syntax that makes up a smart contract. These can include “Function calls”, “values”, “variables”, “arrays” etc.

  • Condition Complexity: Condition complexity describes the syntactic structure complexity of a piece of source code.

  • Condition Distance: Condition distance is the number of function calls from the entry point of the smart contract to the location of the condition statement.

  • Control Dependency: In order to run on EVMs, smart contracts are compiled into opcodes. An opcode opj is said to be control-dependent on opi if there exists an execution from opi to opj such that opj post-dominates all opk in the path from opi to opk (excluding opi) but does not post-dominates opi. An opcode opj is said to post-dominate an opcode opi if all traces starting from opi must thorough opj.

  • Data Dependency: An opcode opj is said to be data-dependent on opi if there exists a trace that executes opi and subsequently opj such that W(opi) ⋂ R(opj) ≠ ∅,where R(opj) is a set of locations read by opj and W(opi) is a set of locations written by opi

  • Reentrancy Vulnerability: A danger when calling external smart contracts is that they can “re-enter” and call the calling smart contract, take over the control flow and make changes to the data of the calling smart contract function. A real-world example of this would be the 2016 DAO hack where attackers withdrew the equivalent of 150 million USD worth of ether more than they were eligible for.

  • Dangerous Delegatecall Vulnerability: Delegatecall is a calling mechanism for caller smart contract calls targeting smart contract functions in such a way that user information will not be exposed but the caller contract itself will be exposed. A trace suffers from a dangerous delegatecall vulnerability if it executes an opcode opc ∈ C that depends on an opcode DELEGATECALL.

  • Tx-origin Misuse Vulnerability: Tx.origin is a Solidity global variable that traverses the entire call stack of the current smart contract and contains the addresses of the account that originally sent the call or transaction to the current smart contract. The variable could be used for phishing-like attacks as it leaves the contract vulnerable as a constant global point of attack.

  • Cross contract Vulnerability: A group of contracts suffers from cross-contract vulnerability if there is a vulnerable trace from executions of condition statements from more than two contracts.

  • Function Priority Score - SF = fs X (SC + 1) X (Sp + 1): Function priority score evaluates the complexity of a function. Higher priority functions that have lower Function priority scores are likely to contain vulnerabilities. Suspicious factor fs is scored 0.5 and benign factor fs is scored 1, caller dimensionality SC, the number of callers of a function, is then computed and the parameter dimensionality Sp is summed according to whether the parameter is complex or not. Complex parameters are scored with two dimensionalities while normal arrays, bytes, or address parameters are scored with one dimensionality.

  • Caller Priority Score - Spath = (CondDis + 1) X (Comp + 1): Caller priority is designed to measure the cost to traverse a program’s path and is based on both condition complexity and condition distance. To describe the difficulties to bypass condition statements, the number of if, for, while statements and require, assert assertions are used to measure condition complexity and condition distance is used to count the number of statements between entry of the smart contract and its first condition. Static features are collected from every function that is called, caller priority is scored as to which function should be tested first.


  • There are 3 main challenges to utilizing machine learning to guide cross-contract fuzzing for vulnerability detection:

  • “How do you train the machine learning model and achieve satisfactory precision and recall?”

  • “How do you combine the trained model with the fuzzer to reduce search space towards efficient fuzzing?”

  • “How do you empower the guided fuzzer with the support of effective cross-contract vulnerability detection?”

  • In the machine learning model training phase, data was collected, features were engineered, and models were evaluated. Static analyzers of SLITHER, SECURIFY, and SOLHINT detected vulnerabilities in the smart contract data set whose reports were used to label the smart contract according to whether the specific contract was “vulnerable” or not. To be deemed “vulnerable”, at least two of the three static analyzers identified the same vulnerability for upvoting, otherwise, it was considered a benign function.

  • Feature engineering happens via input contracts being compiled into bytecode and then vectorized into vectors by Word2Vec. These vectors were combined with static features like can_send_eth, has_call, callee_external which are extracted from ASTs and CFGs to address challenge 1.

  • For Model Selection, XGBoost, EasyEnsembleClassifier, Decision Tree, and machine learning models like Logistic Regression, Bayes Models, Support Vector Machines, and Long short term memory neural networks were evaluated to determine whether these models best fit an imbalanced data set of 7,391 smart contracts. The authors found that tree-based models had better precision and recall, while non-tree-based models showed very poor classification rates on minor classes, showing bias towards major classes.

  • In the Guided Testing Phase, contracts were inputted into pre-trained models to output predictors that will be then used to be evaluated for performance. Functions that have been predicted as vulnerable combined with static analyzer results will be prioritized for fuzzing first to compute path priority scores. Using this score, the search space of exploitability can be reduced and fuzzing can be “guided”.


  • The authors collected 100,139 Smart contracts to preprocess. These contracts are labeled using the voting results of SOLHINT v2.3.1, SLITHER v0.6.9 and SECURIFY v1.0. 788 reentrancy, 40 delegatecall, and 334 tx-origin vulnerabilities were detected using this approach.

  • Then, SLITHER extracted each Smart Contracts runtime bytecode for vectorization using Word2Vec into a 20-dimensional vector. Subsequently, 7 additional static features were extracted from CFGs, and added to the 20-dimensional vectors.

  • XGBoost, EasyEnsembleClassifier, and Decision Trees were selected because of their better precision and recall compared to most of the other machine learning Techniques.

  • EasyEnsembleClassifier outperforms XGBoost in recall, with XGBoost holding a precision rate of 66% and a recall rate of 48%, while EasyEnsembleClassifier has a precision rate of 26%, recall rate of 95%. EasyEnsembleClassifier was selected for further path prioritization for fuzzing because the authors goal was to detect as many “real” vulnerabilities as possible while filtering out benign contracts.

  • The author’s evaluation aims at investigating the research questions (RQ) of 1. “How effective is xFuzz in detecting cross-contract vulnerability?”; 2. “To what extent do the machine learning models and path prioritization contribute to reducing search space?”, 3. “What are the overheads of xFuzz, compared to the vanilla sFuzz?” and 4 “Can xFuzz discover real-world unknown cross-contract vulnerabilities and what are the reasons for false negatives?”

  • Datasets of smart contracts come from previously published works, smart contract vulnerability websites with a good reputation, and smart contracts from Etherscan. Duplicate contracts, contracts without external calls, and contracts that were not developed using Solidity 0.4.24 and 0.4.25 were removed. DataSet1 has contracts from popular websites or previous works and Dataset2 contains 7,391 contracts downloaded from Etherscan.


  • xFuzz detected 18 exploitable cross-contract vulnerabilities using a validation data set of 7,391 Smart Contracts.

  • 15 vulnerabilities were detected that 3 other state-of-the-art cross-contract vulnerabilities detection tools failed to detect.

  • Efficiency in detecting non-cross-contract vulnerabilities was also shown with 20% less time spent when compared to existing cross-contract fuzzing tools.

  • RQ1 results: Vulnerability Detection Effectiveness. ContractFuzzer failed to find vulnerabilities, sFuzz missed 3 vulnerabilities while outputting 9 incorrect contracts and xFuzz missed 2 vulnerabilities while outputting 6 incorrect reports. Missed or incorrectly detected vulnerabilities were due to traversal blocking of conditions.

  • “P%” represents precision rate and “R%” represents the number of reported contracts. “C.V.” and “C.F.” are for Clairvoyance and ContractFuzzer respectively.

  • ContractFuzzer achieves the best 100% precision rate but worst 1.7% recall rate for reentrancy. sFuzz and Clairvoyance identified 33.5% and 40.4% vulnerabilities respectively. xFuzz has a slightly lower precision rate than ContractFuzzer but the best recall rate of 86.1%.xFuzz detected a total of 209 vulnerabilities.

  • Answer to RQ1: xFuzz has 96.1% precision and 86.1% recall. Of the 4 methods, xFuzz has the best recall and has successfully found 209 real-world, non-cross-contract vulnerabilities as well as 18 real-world cross-contract vulnerabilities.

  • RQ2 results: The Effectiveness of Guided Testing

    • Figure 8 and Figure 9. Summarize the results. Tx-origin vulnerability is not included because sFuzz does not support it.

  • Most vulnerabilities are found in the first 30s as the curves climb/drop sharply at the beginning then saturate/flatten afterward.
  • The conclusion is that machine learning models reduce fuzzing time on likely benign contracts from 180 seconds to just 30 seconds.

  • “Top 10” are the first 10 path vulnerabilities detected in Table 5. xFuzz found 152 of 172 reentrancy vulnerabilities and 32 of 33 delegatecall vulnerabilities in the first 10 explored paths. This clearly shows the effectiveness of Path Prioritization would bring to reducing the cost in time of detecting vulnerabilities.

  • Answer to RQ2: machine learning models enabled the authors to significantly reduce fuzzing times on contracts that were likely to be benign and detected vulnerabilities efficiently through the author’s path prioritization techniques.

  • RQ3 results: Detection Efficiency

  • Experiments are replayed 5 times to eliminate randomness during fuzzing. The average results are reported in Table 6.

  • “MPT” means model prediction time, “ST” means search time, “DT” means detection time, and “N.A.” means the tool is not designed for fuzzing this entry or is not supported.

  • xFuzz is obviously faster than sFuzz, saving 80% of the search time.

  • xFuzz is slightly slower than sFuzz in terms of effective fuzzing time. For example, an additional 32.5 minutes is used for fuzzing cross-contract vulnerabilities due to the increased number of cross-contract paths to trace by the machine learning model.

  • Clairvoyance is the fastest tool as it is a static detector without runtime contract support

  • Answer to RQ3: The guided fuzzer, xFuzz, saves over 80% of search time while reporting more vulnerabilities than sFuzz with 20% less time compared to sFuzz.

  • Answer to RQ4: xFuzz was capable of locating vulnerabilities in real-world contracts with the utilization of model predictions and path prioritizations. There were false negatives because of complex path conditions. These may be potentially addressed via integrating hybrid fuzzing into xFuzz

Discussion and Key Takeaways

  • Search Space Reduction via ML: the work proposed a machine-learning approach to reduce the search space of pathways of program tracing that achieves a model recall of 95% on a training data set of 100,000 smart contracts.
  • Performance Analysis: compared with 3 other state-of-the-art cross-contract vulnerability detection tools, xFuzz outperforms them by 42.8% in model recall with a satisfactory precision of 96.1%.
  • Expert Recognized Vulnerabilities: xFuzz found 18 cross-contract vulnerabilities that are all verified as vulnerabilities by security experts from the author’s industry partners.

Implications and Follow-ups

  • For program analysis, the authors drew valuable domain-specific knowledge from existing works. SLITHER, OYENTE, and Atzei et al. informed the authors’ outlook on transparent smart contract detection… Chen et al. and Durieux et al. helped the authors to find limitations of existing tools by evaluating state-of-the-art of cross-contract vulnerability detectors.

  • For cross-contract vulnerability, He et al report that existing tools fail to detect vulnerabilities that are only executed at deeper states. Tracking vulnerabilities by taint analysis after constructing ICFG (combining CFGs with call graphs) was proposed to address this issue.

  • Jiang et al and Wustholz et al. proposed to learn fuzzing strategies from inputs generated from a symbolic expert for early smart contracts. These methods inspired the authors to use a guide to reduce the cross-contract vulnerability search space.

  • To better understand malicious cross-contract attacks, Zhuang et al. proposed building graph networks. These works inspired the authors to introduce machine learning as a method for detecting vulnerabilities.Machine learning-guided methodology for fuzzing cross-contract vulnerabilities is the key novelty that separates xFuzz from Jiang, & Wustholz.

  • Model selection criteria were inspired by the work of Liu et al. which helped the authors select the best models for recall performance and precision on highly imbalanced datasets.

  • Handling of complex path conditions can be further improved as 3 cross-contract and 24 non-cross-contract vulnerabilities were missed because of prolonged fuzzing time or penetration blocking. xFuzz failed to penetrate conditions such as “Function calls”, “values”, “variables” and “arrays” where potential vulnerabilities may be detected and may be potentially addressed using a theorem prover, for instance, Z3. Integrating a symbolic execution in a lightweight hybrid fuzzing approach may further improve xFuzz.


  • The authors of xFuzz aim to address the issue of cross-contract transaction vulnerabilities, particularly to enhance the performance of existing tools that identify common sets of vulnerabilities.
  • Typical tools may only identify vulnerabilities between 2 contracts at a time. Contemporary vulnerabilities scale recursively to multiple contract calls interacting with each other.
  • xFuzz has efficiently addressed the search space issue of this scaling complexity of vulnerabilities and applied a robust and automated machine learning guided smart contract fuzzing framework to detect smart contract transaction vulnerabilities.
1 Like

@Tony_Siu I have a cheesy question for you – why did you choose this piece to summarize? Does it overlap with any of your research interests? I know that you’re working on a startup, does this have anything to do with it?

1 Like

As cheesy as it is, the reason why I chose to summarize this topic is because the core technicalities doesn’t really lie solely within the paradigm of blockchain technology and the perspectives with in the blockchain ecosystem. In fact, one of the incentives for me to write research summaries at SCRF is for me to discover for my self and be able to categorize a “mapping” of where/how Blockchain fits into the larger domain of Computer Science. This piece just so happen to coincide between many disciplines of machine learning, surface level programming linguistics/compiler theory, statistics, little bit of network security, smart contracts and ultimately the design of a programming language! All these topics in itself are ripe for dissertation level research by its own! So for me personally, through summarizing this research paper, I am able to get a better perspective of how the core “module”, the Smart Contract, works within the broader concept of blockchain networks.


Hi @Tony_Siu. Good to see you on the forum again.

I’m curious about how “Context-Free Grammars” and “Abstract Syntax Trees” relate to the way that any programming language—from machine code up to a language of high-level abstraction—is structured. Do you have any thoughts on that?


It just so happens that I finished a course called “Language structures” which is basically programming linguistics last semester. This course is a prerequisite to Compiler Theory. It entails how high-level programming language design is structured and differentiated. First, any syntax is tokenized into “Context Free Grammar”, typically in Backus Naur Form, then parsed into “Abstract Syntax Trees”, Parse Trees. This is so that programming languages can be first disambiguated and organized for a compiler to covert those structures into machine code.

“In the mid-1950s, Noam Chomsky, a noted linguist (among other things) described four classes of generative devices or grammars that define four classes of language.(Chomsky,1956,1959). Two of these grammar classes names, context-free and regular, turned out to be useful for describing the syntax of programming languages.” - Concepts of Programming Languages, Eleventh Edition, page 137.

John Backus who was working on the development of ALGOL 58 at the time, built upon Chomsky’s “Context Free Grammar” to form a new formal notation for describing programming syntax, the Backus Naur form(BNF).To put it briefly, “Context-Free Grammar”, BNF, is a set of rules, metalanguage, that can be used to generate different combinations of syntax. BNF would deal with abstractions called nonterminal symbols and terminal symbols that are recusively parsed lexeme, or token, by lexeme.

The sub-group of “Abstract Syntax Tree” called Parse Trees is just another representation of this syntax generation ruling. As “Context-Free grammars” are recursively generated and parsed, it comes naturally that the generation of syntax can be mapped hierarchically in to tree-like structures. Parsing “Context-Free grammars” into Parse Trees allow for a better illustration of syntax diambiguity.

Below is a simple use case;

As computers can’t really determine for themselves how to read this line of simple operations, “do I assign B to A first?”, “do I do the multiplication of B to A first?”, “do I read from left to right or right to left?”, these programming structures, CFG and ASTs, allow for computers to understand what humans intuitive grasp and take for granted. These formal linguistic methodologies are very important to which a computer can understand and generate computation to execute. With out these “first principles” provided for computers, there is no discussion about tooling or programming language. In relation to the paper, “Machine Learning Guided Cross-Contract Fuzzing”, I think its important to first understand the “first principles” the authors used to do program analysis of the language tooling behind their analysis of Smart Contracts. May diving deep into Jiang et al and Wustholz et al would help.


@Tony_Siu Why is EasyEnsembleClassifier selected over XGBoost despite XGBoost seemingly having a better benchmark precision and recall harmonized score?