Research Summary: Tutela : An Open-Source Tool for Assessing User-Privacy on Ethereum and Tornado Cash

TLDR

  • The authors introduced Tutela, an open-source tool for assessing the level of user privacy on Ethereum and Tornado Cash.
  • They provided a brief and quantitative analysis of the application range and efficacy of the Tutela heuristics.
  • The research also revealed that using Tutela would assist blockchain users in better securing themselves within the current ecosystem.

Core Research Question

How can Tutela be a tool to support Tornado cash and Ethereum to calculate user security scores, secure user transaction histories, and detect security threats?

Citation

Mike Wu, Will McTighe, Kaili Wang, Istvan A. Seres, Nick Bax, Manuel Puebla & Mariano Mendez
Federico Carrone, Tomas De Mattey, Herman O. Demaestri, Mariano Nicolini, Pedro Fontana. (2022). Tutela: An open-source tool for assessing user-privacy on Ethereum and Tornado Cash. arXiv:2201.06811 [cs.CR]. Available at: https://arxiv.org/pdf/2201.06811.pdf

Background

  • Tornado cash (TORN): A decentralized as well as non-custodial privacy solution that is built on Ethereum. Users can break links with their transaction history using a so-called mixer contract. Tornado Cash (TC) is the most widely used, non-custodial mixer on Ethereum.
  • Tutela: An Ethereum wallet anonymity detection tool, to tell people if their blockchain transactions have revealed anything about their identity.
  • Ethereum Name Service (ENS): A distributed, open, and extensible naming system based on the Ethereum blockchain.
  • Address reuse: The use of the same address for multiple transactions. It is an unintended practice, abusing the privacy and security of the participants of the transactions as well as future holders of their value. It also only functions by accident, not by design, so cannot be depended on to work reliably.
  • Principal Component Analysis (PCA): A technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss.
  • Gaussian kernel: The Gaussian kernel transforms the dot product in the infinite-dimensional space into the Gaussian function of the distance between points in the data space: If two points in the data space are nearby then the angle between the vectors that represent them in the kernel space will be small.
  • Decentralized applications (DApps): Blockchain program(s) designed for the end user primarily on the Ethereum blockchain, or any other network capable of launching Turing-complete programs.
  • Externally owned accounts (EOAs): Accounts that are controlled by a private key and have no coding associated with them. If you hold the private key associated with an EOA, you can send Ether and messages from it.
  • Market Capitalization (Market Cap): The most recent market value of a company’s outstanding shares. The Market Cap is equal to the current share price multiplied by the number of shares outstanding.
  • Smart Contracts (SMC): A program constituting ‘if’ and ‘then’ commands executed on a blockchain, for example, the transaction framework with which NFTs are bought and sold.
  • Kaggle: An online community platform for data scientists and machine learning enthusiasts. Kaggle allows users to collaborate with other users, find and publish datasets, use GPU-integrated notebooks, and compete with other data scientists to solve data science challenges.
  • Deposit Address Reuse(DAR): When you send tokens from an Ethereum wallet to your account at a centralized exchange, the exchange creates a unique deposit address for each customer. If you reuse the same deposit address by sending tokens from multiple Ethereum wallets to it, your two wallets can be linked. Even if you send tokens from multiple wallets to multiple deposits, all of these addresses can be linked.
  • Centralized exchange (CEX): Exchange that uses a third party to facilitate the transactions between the sellers and buyers.
  • Diff2Vec: Machine learning algorithm. Applying it to Ethereum transactions allows the clustering of potentially related addresses.
  • Principal component analysis (PCA): The process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.

Summary

  • The authors introduce Tutela, an application built on expert heuristics to report the true anonymity of an Ethereum address.
  • The anonymity that underpins blockchains fosters a feeling of privacy, which frequently leads to abuses such as money laundering or undue voting power.
  • Addresses with the same owner can be clustered together using graph analysis techniques.
  • While Tornado Cash is in use, other tools are generally limited to academic contexts and have not yet been implemented or shown in public services.
  • The researchers created a web application that uses many cutting-edge algorithms to determine the anonymity of Ethereum addresses.
  • They also suggest a set of additional heuristics aimed against Tornado Cash, emphasizing that even when utilizing a mixer, negligent user activity might betray identity.

Method

  • Tutela is a web application with three functions: it notifies users which of their Ethereum addresses are associated, how they are linked, and analyzes Tornado Cash Pools’ anonymity setups.
    • Ethereum Address Clustering
    • Ethereum Address Reveals
    • Tornado Cash Anonymity Set Auditor
  • They perform two transaction sets to serve as data sources for Tornado Cash-specific heuristics.
    • BigQuery is used to retrieve Ethereum transactions from the crypto_ethereum dataset.
    • A public Kaggle list of known addresses is used to identify exchange addresses for heuristics and to put known limitations on the inferred identification of clustered addresses.
  • Furthermore, they divided the transaction data from crypto_ethereum concerning Tornado Cash pools.
    • When a relayer withdraws from a Tornado Cash pool, they decode the input code using the contract ABI to get the beneficiary address.
    • They discover around 97,365 deposits and 83,782 withdrawal transactions across all pools.
  • They discuss two Ethereum-wide techniques used to cluster together addresses that may belong to the same entity using the above big dataset.
    • DAR
      • DAR connects EOAs with the help of a CEX.
      • The DAR method employs heuristics to identify deposit addresses. It employs two hyperparameters: the maximum amount and time gap between two transactions.
    • NODE
      • They examine a second Ethereum-wide heuristic (NODE) that projects addresses points in a low-dimensional vector space based on who it transacts with as a supplement to DAR.
      • Addresses belonging to the same entity should be near Euclidean distance in this vector space.
      • They concentrate on Diff2Vec, which has been used for blockchain transactions.
      • Diff2Vec’s concept is to summarize a node by its vicinity using a diffusion-like random process.
  • They highlighted five Tornado Cash heuristics for detecting compromised deposits. These include:
    • Address Match
    • Unique Gas Price
    • Linked ETH Addresses
    • Multiple Denomination
    • TORN Mining
  • They presented a quick quantitative analysis of the Tutela heuristics scale and efficacy.
    • Ethereum Heuristics
      • They discovered 26M EOA addresses using DAR, resulting in 2.5M clusters of Ethereum addresses.
      • They discovered 131M clusters of Ethereum addresses using Node, with each cluster having exactly 9 members by design (10 including itself).
      • Lastly, they combined DAR and NODE to test.
    • Tornado Cash Heuristics
      • They discovered that 42.8k of the 97.3k Tornado Cash equal user deposits are likely compromised:

Results

  • The figure below represents an example of the Tornado Cash 1 ETH pool.

  • Addresses A through F deposit to and withdraw from the pool.

  • It quickly becomes impossible to associate withdraw and deposit transactions given a growing mixer.

  • Graph of transactions between EOA, deposit, and CEX addresses.

  • A cluster is defined as a weakly linked component of an undirected subgraph that solely contains EOA and deposit nodes.

  • The gray circles show EOA addresses in two clusters.

  • The four steps of the Diff2Vec algorithm.

  • Tornado Cash Heuristics

    • Address Match

  • A single address withdrawing and depositing to a TC pool is represented by the triangle.

  • Unique Gas Price

  • Two addresses are depositing and withdrawing with the same gas price of 27.4 gwei.

  • Linked ETH Addresses

  • The green arrows represent interactions between two addresses A and D outside TC.

  • Addresses A and D deposit and withdraw from the same Tornado Cash pool, respectively.

  • Multiple Denomination

  • Addresses A and D deposit and withdraw the same number of times from the same three Tornado Cash pools, respectively.

  • TORN Mining

  • Address D was given 10 TORN upon withdrawing from the 1 ETH pool in return for anonymity points, linking address D to a deposit 100 blocks prior.

  • Searching records, only address A deposited in the 1 ETH pool 100 blocks prior, compromising address D.

  • Note that the numbers presented here are for explanatory purposes.

  • Visualization of 10,000 random nodes from the DAR graph

  • Using DAR, a 10k subgraph of a 26M graph was generated. Contains EOA (green), deposits (blue), and exchanges (red).

  • They see an intriguing pattern with numerous little clusters distributed equally, balanced by many big clusters around the perimeter.

  • While there is potential for improvement, they are cautiously confident that DAR will be able to retrieve a nontrivial amount without knowing the identities of ENS. This demonstrates the universality of DAR clusters.

  • Using DAR, they discover a recall of 39.4%.

  • A visual representation of 100,000 random embeddings of Ethereum addresses from the NODE collection, with embeddings projected down to two dimensions using PCA (trained on a subset of 1M address embeddings).

  • The color shows the density, estimated using a Gaussian kernel, where a lighter (yellow) color represents higher density.

  • NODE has a recall of 37.8%, which is 2% lower than DAR.

  • When DAR and NODE are employed together, the recall jumps from 7% to 44.8%, demonstrating that DAR and NODE locate various “types” of clusters.

  • The recall of held-out address clusters using ENS exposes the use of DAR, diff2vec (NODE), and a combination of both (BOTH). A greater recall indicates a more effective heuristic.

  • The plot of the percentage of compromised versus uncompromised (pink) deposits by the pool.

  • They discover that some pools may be severely compromised (such as the cDAI and WBTC pools), while others are less affected (e.g. USDC).

Discussion and Key Takeaways

  • Limitations: The heuristics will likely result in false positives in practice, for picking proper hyperparameters is a challenge and for being computationally expensive.
  • Extensions: The use of off-chain data may result in more effective de-anonymization attacks.
  • Broader Impact: Privacy solutions, in conjunction with regulation, will need to account for factors such as money laundering and illegal activity.

Implications and Follow-ups

  • The account model encourages account reuse, implying that most users only have a few accounts and that accounts held by the same user may effectively be grouped.
  • Tornado Cash heuristics are significantly simpler and more predictable than Ethereum heuristics.
  • While Ethereum-wide heuristics(DAR and NODE) find clusters of compromised addresses, Tornado Cash heuristics find clusters of compromised transactions. A subset of the Tornado Cash heuristics could also be applied to Ethereum.
  • Tutela obtains anonymity by simply using on-chain data.

Applicability

  • These findings may aid Tornado Cash developers and users in measuring and comprehending the level of user privacy provided.
  • Tornado Cash heuristics are significantly simpler than Ethereum heuristics, but because only a tiny portion of Ethereum addresses are Tornado Cash users, they are limited in their application to the bulk of prospective Tutela users.
  • Tutela will assist law-abiding blockchain users in better protecting themselves in the present environment until privacy solutions paired with legislation are available.
3 Likes

Thank you @stayhungry07212, for your time and effort in this paper, I think is becoming obvious that the pseudonymity underpinning blockchains like Bitcoin and Ethereum breeds a sense of privacy. And I am sure that this can possibly cause money laundering through a large number of addresses and unfair voting power distributed among multiple addresses owned by the same user.

I think it is important to identify addresses that linked to the same entity and it is predominantly done through heuristics. However, it has its short comings, that may be the reason the author acknowledged that heuristics are not perfect measures. i am of the view that this tool is a good development for instance, In February 2021, Tornado Cash introduced anonymity mining. It was an incentive scheme to encourage more deposits in Tornado Cash pools, thereby increasing their anonymity sets. They even rewarded participants a fixed amount of anonymity points (AP) based on how long they left their assets in a pool. Is it not a good thing? After withdrawing assets, users can claim Anonymity Points. The amount withdrawn is recorded in the transaction. If a user uses a single address to claim all of their anonymity points, one can calculate the exact number of Ethereum blocks that their assets were in the pool because the AP yields were public and fixed.

This work was financed by the Tornado Cash community bounty to develop anonymity tools to protect user privacy. Which is a good thing. I think they did a good job. I additionally observed that Tutela uses only on-chain data to access anonymity. However, I think it would be a good thing if extensions can be made to include off-chain data, such as from decentralized applications (e.g. DeFI, NFT, games, etc.), layer two data, external blockchains, and more. The inclusion of off-chain data could lead to more powerful deanonymization attacks. There is a need for greater privacy on the blockchain to accelerate adoption.

2 Likes