Research Summary: MANDO-GURU: Vulnerability Detection for Smart Contract Source Code by Heterogeneous Graph Embeddings

Summarized by Nhat-Minh Nguyen (@nmnguyen)

TLDR

  • Smart contracts are increasingly used with blockchain systems for high-value applications. It is highly desired to ensure the quality of smart contract source code before they are deployed. A DAO hack, realizing a vulnerability, stole 3.6 million Ether by exploiting the fallback function in the code that was exposed to reentrancy.
  • In this paper, we propose a new tool with a new method for representing smart contracts as specialized graphs and learning their patterns automatically via graph neural networks on a large scale to detect vulnerabilities at both line-level and contract-level accuracy.
  • We have deployed the MANDO-GURU web app for visualizing specialized interactive graphs and highlighted vulnerabilities that help users to double-check their smart contracts easier.

Citation

Hoang H. Nguyen, Nhat-Minh Nguyen, Hong-Phuc Doan, Zahra Ahmadi, Thanh-Nam Doan, and Lingxiao Jiang. 2022. MANDO-GURU: Vulnerability Detection for Smart Contract Source Code by Heterogeneous Graph Embeddings. In Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’22), Singapore, 14 - 18 November, 2022.

Preprint: https://hoanghnguyen.com/assets/pdf/nguyen2022fse.pdf

Core Research Question

  • What are the performances of our models compared to several state-of-the-art baselines on contract-level vulnerability classification?
  • What are the performances of our models on line-level vulnerability detection?

Background

  • Smart contracts are self-executing lines of code with the terms of an agreement between buyer and seller automatically verified and executed via a computer network contained therein exists across a distributed, decentralized blockchain network.
  • Control-flow graph (CFG) is a representation of all paths that might be traversed through a program during its execution.
  • Call graph (CG) represents calling relationships between subroutines in a computer program.
  • Graph Neural Network is a class of artificial neural networks for processing data that can be represented as graphs.

Summary

  • Smart contracts are increasingly used with blockchain systems for high-value applications. It is highly desired to ensure the quality of smart contract source code before deployment.
  • More and more individual developers or industry practitioners can develop Decentralized Applications (DApps). However, previous research has shown that many real-world smart contracts deployed on blockchains have serious vulnerabilities, for example, the DAO attack and the Parity attack.
    • The DAO attack exploits a recursive call vulnerability to transfer one-third of the DAO funds to a malicious account (worth about USD 50 million)
    • The Parity attack exploits a vulnerability in the library contract to steal over 150,000 ETH from a malicious account (worth about USD 30 million).

Method

MANDO-GURU contains three main components: Backend, RESTful APIs, and Frontend.

The Backend plays a vital role with several core sub-components such as heterogeneous representation for the generated graphs from input smart contracts, heterogeneous graph fusion, custom multi-metapaths extraction, heterogeneous graph neural network, and vulnerability detections in coarse-grained and fine-grained levels.

  • Heterogeneous representation: We utilized Slither tool to generate basic CFGs and CGs from smart contract input and then convert them into heterogeneous forms.
  • Heterogeneous contract graph is a fusion of heterogeneous CFGs and CGs which enrich information for learning. Accordingly, the heterogeneous CG edges of the smart contract act as bridges to link the discrete heterogeneous CFGs of the smart contract functions into a global fused graph.
  • Multi-metapaths extraction: pre-defining all possible metapaths with any length according to all possible node types and edge types is a challenge, it would lead to an exponential explosion of meta-paths. Besides, the order of these node types can change dynamically depending on the input contracts’ structures. In order to address the problem of exploding and changing metapaths, our method focuses on length-2 metapaths through reflective connections between adjacent nodes to extract multiple metapaths contracts’ structures.
  • Heterogeneous graph neural network: we separated our detection into 2 phases: Coarse-Grained Detection and Fine-Grained Detection.
    • Coarse-Grained Detection: This phase classifies if a smart contract contains a vulnerability. We embedded to represent each input smart contract, and train the MLP to predict clean or vulnerable contracts. This classification assists in reducing the search space by filtering out those clean contracts and reducing noisy data before the second phase of fine-grained vulnerability detection at the line level.
    • Fine-Grained Detection: we apply node classification on the node embeddings of their Heterogeneous Contract Graph to identify the nodes that may contain vulnerabilities, which correspond to statements or lines of code and allow us to detect the locations of the vulnerabilities at the fine-grained line level in smart contract source code.

Results

The first table is our best model, which improved buggy f1 score compared with the original Heterogeneous GNN (metapath2vec) and the best of 3 original Homogeneous GNNs (GCN, LINE, node2vec)

Access Control Arithmetic Denial of Service Front Running Reentrancy Time Manipulation Unchecked Low Level Calls
Heterogeneous GNN 62.90% 56.46% 55.17% 63.40% 61.79% 66.29% 55.22%
Homogeneous GNNs 62.63% 58.59% 60.12% 64.77% 66.23% 66.65% 61.69%
MANDO-GURU 71.19% 66.85% 89.15% 89.86% 76.09% 87.71% 72.08%

The second table is our best model comparing with the best of 6 conventional tools (Securify, Mythril, Slither, Manticore, Smartcheck, Oyente) and original Heterogeneous GNN (metapath2vec) and the best of 3 original Homogeneous GNNs (GCN, LINE, node2vec).

Access Control Arithmetic Denial of Service Front Running Reentrancy Time Manipulation Unchecked Low Level Calls
Conventional Detection Tools 34.0% 73.0% 52.0% 63.0% 23.0% 44.0% 14.0%
Heterogeneous GNN 35.46% 68.70% 60.64% 80.65% 71.66% 67.51% 26.06%
Homogeneous GNNs 53.59% 68.61% 64.06% 83.06% 74.78% 70.76% 38.13%
MANDO-GURU 80.93% 84.35% 82.12% 90.51% 86.40% 90.29% 84.81%

Discussion and Key Takeaways

Applicability

  • Our method is a valuable complement to other vulnerability detection techniques and contributes to smart contract security.
  • Furthermore, We can also adapt our method to cases where only compiled smart contract bytecode is available without source code to expand.
  • Our approach also fits other programming languages as long as they can be represented in a graph form.
13 Likes

Hi,

Thank you for sharing! Very interesting work. Could you comment on the training set, in particular, how you obtain the labeled dataset, and how large is it?

4 Likes

Hi @mainarke,

Thank you for your question.

Since the MANDO-GURU is a tool paper, we lack space to present the dataset and experimental results.
You can find the details of the training set and labeled data in our research paper “MANDO: Multi-Level Heterogeneous Graph Embeddings for Fine-Grained Detection of Smart Contract Vulnerabilities,” which has been presented at the DSSA conference 2022 and will be officially published in the following weeks. The MANDO paper explains in detail the core technologies and experiments that are applied in the MANDO-GURU tool. You can find the pre-print of the MANDO paper in the following Arxiv link:
[2208.13252] MANDO: Multi-Level Heterogeneous Graph Embeddings for Fine-Grained Detection of Smart Contract Vulnerabilities.

We are also summarizing the MANDO paper and will post it in the SCRF forum in the following weeks.

3 Likes

Thanks @hoanghnguyen for coming up with this interesting research paper, is very clear that the author proposes a new deep learning-based tool, MANDO-GURU, that aims to accurately detect vulnerabilities in smart contracts at both coarse-grained contract-level and fine-grained line-level. Is quite interesting to know that one primary contribution of MANDO- GURU is to focus on capturing and retaining more structures and semantics of source code through our heterogeneous representations. So based on some datasets curated from the real world which shows that MANDO-GURU can detect seven types of smart contracts more accurately on average than several baseline methods and thus is a promising complement to other vulnerability detection techniques. But do you think if there are any challenges associated with MANDO-GURU?

4 Likes

These two amazing researchers with this level of hypothesis need funding for a lab, to conduct actual “ lab work you two may not know one and another but these two articles are working in parallel efforts, both of these research documents sophisticated to train an AIs and here are my thoughts on how it could work.

It can be very simple or very complex. In its range of complexity, it can go from rule-based systems where it is designed to make decisions based on rules and inputs; and can go up to more adaptive systems
• Neural Networks;
• Natural Language Processing;
• Knowledge graphs;
• Expert Systems;
• Search;
• Mini-max algorithm;
• Logic

Similarly, Natural Language Processing is an important area in smart contracts
• Shallow semantic parsing;
• Named entity recognition;
• Coreference resolution, and others.

As in demonstrating of how AI will work with a text agreement and in turn how it will be executed in a Shallow Semantic Parsing as a logical evolution.

As opposed to our current Smart Contracts decision based on the inputs and rules of deceitful operators.

With your research Smart Contracts decision based inputs and rules. Only then the effectiveness of Smart Contracts will become more adaptive to include logic, neural graphs, and neural networks.

The AI can develop and deploy the Smart Contracts based on vital analysis to lead the predictions whether or not the contract will be deployed.

AI along with smart contracts can be used in two manners:
1. In the negotiation of the terms of the agreement on behalf of the party; and/or
2. Controlling the self-executing nature of smart contracts.

To conclude my point your research effectiveness stems from the elimination of human intervention in terms of verification of the contracts.
This renders the negotiation process to become simpler and expedited.

But would it becomes easier to form complex agreements? The introduction of AI in the arena of smart contracts will launch us into a new era and this era will bear witness to the prosperity and perish of many businesses and legal professionals.

4 Likes

Hi @Henry , thank you for your comment. About your question about our challenge, I would say that it is the lack of quality datasets having buggy labels in the line level of source code. Since our approach is deep graph learning, more labeled data are always required for the learning models to improve the learning features process. However, in the future, after getting sufficient labeled data, the proposed model can work well and independently with experts’ knowledge.

5 Likes

Thank you, your summary is wonderful so is your application MANDO-GURU. this application can enable the developers to debug and resolve common bugs in their source code.
I watched the youtube explanation and demonstration of the application.
MANDO-GURU detects the following 7 main bugs or errors in source code:

  1. access control,
  2. arithmetic
  3. denial of service
  4. front running
  5. reentrancy
  6. time manipulation
  7. unchecked low-level calls.

I tested the application using source code(prep.js) of basic mathematical operation.


It passed the Access_control and Arithmetic bug but others failed

**

  • Observations:

**

  1. When you choose to attach a file, it allows different file extensions including pdf and images.
  1. after the application analyzed the prep.js file it came out with a function and snippet code that is not contained in the initial file.

I don’t know if my questions are silly, please can this application be used to:

Perform a test on source code that is not a smart contract?

is it possible to restrict the type of file being uploaded for the test in the application?

what happens in a case where source code has bugs like Gas overflow during iteration-DoS, Integer overflow/underflow-DoS, and Storing private data?

7 Likes

@hoanghnguyen I am impressed by your work and your reserving research; Fine work

Contract vulnerabilities can be found using the source code of smart contracts. A vulnerability is a flaw in the planning or execution of a software or system that enables an attacker to access information, data, or resources, frequently with negative consequences.

Because formal semantics are rigid, smart contracts have this issue as well. Regular programming languages have developed formal semantics over a considerable period of time, and security analysis also uses formal semantics of programming languages. But typically, especially for security analysis, the semantics of smart contracts are less developed. The formal semantics of smart contracts are examined in this paper through the security analysis of a smart contract and the novel technique of heterogeneous graph embedding.

We use Google’s word2vec model, which builds a semantic vector from text and applies it to the source code of smart contracts, as a heterogeneous graph embedding technique. We choose three security flaws in the smart contract source code and demonstrate that our suggested heterogeneous graph embedding method outperforms other embedding methods.

8 Likes

Thanks @hoanghnguyen For this research summary Nice summary, it was nice reading your summary, I think this paper concentrates on the Ethereum smart contracts as a sample of software codes represented by heterogeneous contract graphs built upon both control-flow graphs and call graphs containing different types of nodes and links. I equally observed that the author uses the embeddings to train networks to recognize graphs or nodes that may contain vulnerabilities and thus identify the vulnerable code functions or lines and further applied an approach to the Ethereum smart contracts written in the Solidity programming language. I think this approach the author applied enables novel multilevel graph embeddings for fine-grained detection of smart contract vulnerabilities, and thus, the author named it MANDO. For the purpose of understanding of this paper, it would be of great important to know that MANDO is novel in its graph neural network structure that fuses topological GNN and node-level attentions with heterogeneous GNN to generate both node-level and graph- level embeddings that can capture structural information of graphs more accurately. I think it is unclear which node feature generation method is the best among the heterogeneous and homogeneous GNNs and the node-type one-hot vectors. However, integrating these types of GNNs inside MANDO outperforms all the baselines. Hence, we believe that the architecture of MANDO for combining different GNNs is suitable for classifying vulnerable smart contracts.

9 Likes

Hi @hoanghnguyen, Thank you for this fascinating research paper. I was drawn to your work by the idea of MANDO-GURU, an unsupervised graph embedding-based approach for vulnerability detection, and I appreciate the way you implemented the MANDO-GURU web app for visualizing specialized interactive graphs and highlighting vulnerabilities that make it simple for users to verify their smart contracts.

It is worthy to note that at the core of MANDO-GURU is the use of unsupervised graph embedding schemes to discover vulnerabilities in smart contracts. This procedure begins with the construction of a heterogeneous graph, which is made up of nodes for each component or component type in the smart contract code and edges for the interactions between these components.

The next step is to perform an unsupervised graph embedding, which generates a low-dimensional representation of each node in the graph. This representation preserves both local and global structure, allowing for a comparison of how similar different components are through their similarities in shape.

Following that, a vulnerability score can be assigned to each component based on how vulnerable it is according to what has been identified during the embedding process. If a component is found to be more vulnerable than others, it can be flagged as needing further inspection or addressing. Using this method enables us to both identify and prioritize vulnerabilities in smart contract source code quickly and accurately, without any human interaction or manual review required.

Enhancing Privacy and Security of Smart Contracts Using Heterogeneous Graph Embeddings

Using heterogeneous graph embeddings to enhance the privacy and security of smart contracts can provide a good deal of security and protection. This is because the embeddings are able to capture the relationships between different elements, such as variables and functions, in the source code and can detect any type of vulnerabilities based on the learned patterns.

This kind of approach is not only beneficial for detecting vulnerabilities in smart contract source code, but it also assists in improving their privacy and security overall. By analyzing the graph structure, our model can identify weak points that may be vulnerable to attack, as well as help developers create more secure and private contracts.

Additionally, it is possible to spot nefarious connections between elements by using heterogeneous graph embeddings. The model, for instance, would be capable of identifying any attempts by an attacker to send unauthorized transactions or gain access to confidential data by utilizing known vulnerabilities.

Furthermore, MANDO-heterogeneous GURU’s graph embedding model can provide detailed insights into the code structure, allowing developers to easily detect potential security issues before deploying a contract on a blockchain network.

Conclusively, MANDO-GURU is still in its early stages and has shown promising results. I’m aware the team is currently working on expanding the tool’s capabilities and refining its methodology. But the potential for automated vulnerability detection is huge, and it’s something that we’re keeping an eye on.

3 Likes

Additionally, the system is designed to protect users’ privacy, which means that sensitive data such as source code can be processed without revealing any confidential information. This makes MANDO-GURU a promising tool for improving smart contract security and privacy.

3 Likes

@Raphking Sorry for my late response. I had an extended business trip in recent weeks.

About your questions 1, 2, 3, 4:
The system would not allow for pdf or image files. It seems you selected a file in the dropdown menu. Therefore, the system will priority run with the smart contract source file in the dropdown list and ignore the file you uploaded. That is the reason why you can see the results with the unsupported file types.

About your last question:
The current version only supports seven bug types, as we mentioned. However, our proposed approach is allowed us to expand the tool to any new bug types if we can find good labeled data to train the heterogeneous graph neural network model.

3 Likes

@hoanghnguyen welcome back

Thank you for your explanation, however, I went back to the platform to test it according to your explanation. below are my findings:

  1. I uploaded a file of type image without selecting any option from the dropdown menu and submitted it so that the system will analyze it for vulnerability detection but the progress bar kept rolling “scanning smart contract” endlessly. the same thing happens if I upload the source code of the javascript file.
    furthermore, my suggestion/question is if the system MANDO-GURU can be made in such a way that when choosing a file to upload, it will only allow a file that it can process.
    for example, when applying for a job, the platform will tell you an only a pdf file is allowed and when you choose the upload option all other file formats will be greyed out except supported formats like pdf.

Conclusively it will be fantastic if this platform can be used for vulnerability detection for source code that makes use of graph techniques other than a smart contract, what do you think about this?

1 Like

Hi @Raphking, we have checked the system and fixed the issue. Would you please double-check it?

Thank you for your question about extending the approach for other source codes and not being limited to the smart contract. It is also in our plan. We are working on some transfer learning models to expand the application of our work. So, maybe in the following year, we could present a more general approach applicable to multi-programming languages.

1 Like

@hoanghnguyen Thank you for your response, I just tested the platform now and yes the issue has been fixed, it pops up a dialog box with the message failed when you try analyzing the unsupported file format.

furthermore, I’m looking forward to the extension of the software and I hope you will inform us about this platform when it is ready.

2 Likes

Great research @hoanghnguyen !
It is important to remember that the main goal of MANDO-GURU is to find smart contract vulnerabilities by employing uncontrolled network modelling techniques. Building a diverse network, which has units per any part or variety of part in the cryptographic protocol software and links for their relationships, is the first step in this process.

I however still noticed that now the author implemented a technique to the Solidity-coded Blockchain virtual transactions in order to coach models to detect networks or vertices that could include risks and detect the problematic software operations or sections. I believe the author’s methodology, which he called MANDO, offers unique layered network annotations for precise identification of blockchain network flaws.

Knowing that MANDO is unique in its chart classification algorithm, which merges geometrical GNN and base station efforts with diverse GNN to develop both base station and chart annotations which may record spatial features of charts quite precisely, is crucial for understanding this paper. The optimum network attribute extraction technique between base station one-hot matrices, diverse and coherent GNNs, is uncertain in my opinion. Furthermore, MANDO gives more accurate results than all estimations when these GNNs are integrated into it. Therefore, researchers think that MANDO’s design for mixing several GNNs is appropriate for identifying risky blockchain network.

4 Likes