Research Summary and AMA with Giulia Fanti (Carnegie Mellon) on Analyzing Blockchain Incentive Mechanisms with Deep Reinforcement Learning

Dr. Giulia Fanti, an assistant professor of Electrical and Computer Engineering at Carnegie Mellon University, sat down with Chainlink Labs to discuss her co-authored paper “SquirRL: Automating Attack Discovery on Blockchain Incentive Mechanisms with Deep Reinforcement Learning”.

She has also offered to answer questions about the paper here on the Smart Contract Research Forum for a limited time, so please don’t hesitate to ask!


This recording is the first episode of a new Chainlink Research Report series which features short presentations of exceptional working papers by blockchain scholars around the world.

In this episode, Dr. Giulia Fanti, an assistant professor of Electrical and Computer Engineering at Carnegie Mellon, discusses her paper “SquirRL: Automating Attack Analysis on Blockchain Incentive Mechanisms with Deep Reinforcement Learning.” This paper uses deep reinforcement learning to detect security flaws related to blockchain incentive mechanisms.

Full Video:


  • Incentive mechanisms play an essential role in permissionless blockchains.
  • Designing incentive-compatible mechanisms, in which expression of true preferences are utility maximizing, is challenging.
  • Little is currently known about properties of incentive mechanisms currently operating on large-scale blockchains, making it difficult to test their behavior.
  • Deep reinforcement learning can identify new attack strategies and replicate known strategies such as selfish mining, helping to identify and improve upon weaknesses that were previously unclear.

Dr. Giulia Fanti:

Dr. Giulia Fanti is an assistant professor of Electrical and Computer Engineering at Carnegie Mellon University. She received her Ph.D. in EECS from U.C. Berkeley and her B.S. in ECE from Olin College of Engineering. She is also an academic partner with the Chainlink Labs research team and was previously awarded a Chainlink research grant to further her work.

Her research interests and publications focus on the algorithmic foundations of blockchains, distributed systems, privacy-preserving technologies, and machine learning. Giulia is also a fellow for the World Economic Forum’s Global Future Council on Cybersecurity, and has received a best paper award at ACM Sigmetrics and an NSF Graduate Research Fellowship.

Some of Giulia’s work:

Relevant links for Giulia:


I think this is a really cool implementation of RL, because RL is usually unstable and hard to train, but in this case, it is looking for strategies of exploiting profits, so stability would not be an issue.

It also makes me really curious about why the Nash equilibrium for 3 players and above is for them to be honest.

Could it be relying on the assumption that every player has about the same computing power? Because if so, does that apply to real life?


Thanks for the question–I want to stress that this is just a conjecture, so we don’t have any theoretical proof that honest behavior is a Nash Equilibrium for 3+ parties–it may not be the case! However, we did run some preliminary experiments a while back with varying number and fraction of hash power, and saw the parties appear to converge to honest mining in those settings too. So there is at least some empirical evidence to suggest that if the conjecture is true, it is not only true when all parties have the same hash power. It would be very interesting to see if this can be proved formally, even for e.g. only 3 parties.


Dr. Fanti, welcome to the forum! It’s a real treat having you here in the Forum. (and a big thank you to @jasonanastas for hosting this thread). You mentioned that this kind of selfish mining is very hard to detect; how do most platforms deal with this type of behavior?


This is indeed the textbook intersection between Deep Reinforcement learning based Data science and Blockchain via analysis of Blockchain incentive mechanisms!

I would imagine being able to give constructive feedback to blockchain protocol providers be a priceless insight to further develop end user adoption to their Blockchain protocol. To imagine analysis of Block chain mechanisms vulnerability is already underway while the adoption of the Blockchain towards Web 3.0 is yet to be widespread!

Indeed, I think the biggest hurdle for Blockchain companies is a well defined incentive mechanism to have end users go on chain. Competing against well established cloud services and its centralized ease of use, cross platform convenience and conventional paradigm of Web 2.0 is a difficult task to overcome. To this day, most industries are still trying to adapt to the Web 2.0 landscape!

Other than traditional Paas platform A/B testing, Design of experiments, or adversarial modeling of contributing agents to improve platform usage, analysis of the Blockchain network architecture itself would definitely bring much to new domains of datascience.

Out of the blue, I wonder how feasible is it to undergo such DRL analysis of block chain incentive mechanism? I would imagine setting up a Blockchain Paas to market others to go on chain be a difficult task in and of itself. How would the AIops data pipeline Blockchain architecture look like before even running such analysis?


Thank you for the question! Selfish mining actually is detectable in the real world by looking at the rates at which miners collect rewards, proportional to their hash power. So assuming we have a reasonable estimate for miners’ hash power, we should be able to get a pretty decent estimate as to whether a particular miner is engaging in selfish mining or not. Despite this, we have not seen substantial evidence of selfish mining in practice, to the best of my knowledge. (If you or anyone else knows of studies that have observed selfish mining in the real world, please share a link!) So in practice, many platforms don’t need to deal with this problem. One question that we wanted to understand in this work is why existing platforms don’t observe selfish mining; our results suggest that it may be because in the real world, there are multiple, competing parties, for whom it’s actually not profitable to deviate from the honest strategy (unlike in the two-party setting that is typically analyzed in academic literature).


Hi Tony, in general, DRL can be sensitive to hyperparameters and unstable, as other commenters have pointed out, which makes it potentially tricky to use in practice. It can also be computationally intensive. Our experiments were run on a departmental cluster, and each experiment could take as long as a few days to complete. We believe this is ok from a latency perspective (you only need to analyze your incentive mechanism once), but the associated computational costs could be prohibitive. I expect these costs will come down in the coming years, but these factors could certainly be a barrier to deployment.


Thank you so much for taking the time to share your experiment results, Dr. Fanti.

If you don’t mind me asking, what do you think would happen if there are 3 players, but the hash powers are extremely tailed?

I was thinking that suppose it was (almost 0, almost 50%, almost 50%), then that would be a setting close to 2 player game.

Do you think selfish mining could happen in this setting? When we gradually move the distribution this way (+, -, -), do you think we can assume continuity on player behavior change, and observe a “tipping point” in which selfish mining behavior vanishes?

I’m also curious about what impact do you anticipate different rewarding policies (e.g. policy gradient) would have on the outcome? How do the RLs play differently when they can think “long term”?


I know because of time limitations you had to concentrate on case studies 3 and 4 in this video. Case study 4, which combined a selfish mining attack with a voting attack, piqued my curiosity. The results slide showed that SquirRL (DRL framework you created to analyze incentive mechanisms in an automated way), started to show increased voting reward fraction vs. honest behavior somewhere between 0.198 and 0.231 attacker’s voting/mining power. And it looks like the voting reward fraction increased significantly from there as the attacker’s voting/mining power approached 0.330.

Expanding on the idea of combining attack scenarios, to include Exploring the Attack Surface of Blockchain: A Systematic Overview Figure 25, and possible combinations; when n = 22 and r = 2, C(n,r) = 231. And of course that’s the simplest version, where r = 2; and n = known attack types, but doesn’t account for yet unknown attack types. Additionally, since I’m citing a single source for context purposes, it seems almost certain that n > 22. Bottom line, combining attacks leads to a lot of possibilities.

Assuming a potential adversary would concentrate on trying to use a combination of attacks that would result in the highest rewards ($, voting power, etc.), is there an existing predictive model that estimates which combinations of these attacks are the most likely to occur? In other words, a way to prioritize the order in which SquirRL should simulate attack combinations in order to inform the community what to defend against? Or, are adversaries moving so quickly, that the community is left to simply catch-up and defend against already known attacks, much less possible combinations of attacks?

I’m not sure how many attack combinations have been simulated with SquirRL to date, but it would be interesting to see if somewhere between 0.198 and 0.231, other attack combinations saw a significant rise in increased voting reward fraction or not. And coming at it from that perspective, that > 0.198 attacker’s voting/mining power, could be an indicator/threshold of a potential/impending attack. Because I’m not versed on the research in this space, is there an already existing indicator/threshold used and/or known < 0.330?

Thank you for having the ingenious insight to use DRL instead of MDP to analyze this space and for taking the time to answer questions here!


Dr. Fanti, thanks very much for a fascinating presentation. I have a larger “overarching” question about the choice of consensus mechanisms.

In your video exchange with Jason Anastasopoulos you note that permissionless blockchains need an incentive mechanism to make the system work, but if the mechanism is poorly designed it can bring the whole system crashing down.

Many people have expressed concern that Proof of Work (PoW) greatly contributes to climate apocalypse, while Proof of Stake (PoS) contributes to the equally serious “inequality apocalypse” by insuring that the rich (those with the highest number of tokens) continue to get richer as the poor get poorer.

If there were ever “poorly designed consensus mechanisms” that will guarantee that “the whole system comes crashing down,” we seem to have hit them on our first two tries. Why do you think a more fair incentive scheme is NOT being implemented in the “more fair” world of DeFi?

Finally, what other options are open to humanity, technically speaking? Delegated Proof of Stake (DPoS), for example, seems to have intrinsic “fairness” advantages over PoS. Do you agree with that statement? Are there any other incentive mechanisms that researchers believe could result in a more stable economy over the long term?


We haven’t run that exact experiment, but in the 2-party setting (1 strategic, 1 honest), there is a known threshold of hash power below which selfish mining is no longer profitable, which can be computed (roughly 0.25, but depends on some parameter settings) (Sapirshtein, Sompolinsky, and Zohar, For this reason, my guess is that in the setting you are describing, it would be similar to 2 strategic parties and 1 honest party with just a little bit of hash power. And that actually is an experiment we ran (, Fig. 7). Here we see that as the two strategic parties get closer and closer to 50-50 hash power, their advantage from selfish mining vanishes. And actually, they seem to not be stealing from each other, but from the 3rd honest party, while converging to an honest strategy only when the hash power is actually 50-50. So I believe SM continues to be profitable for 2 strategic players, but our experiments suggest this may not be the case for 3 players.


Thanks for the question–the short answer is no, I don’t know of any systematic way of ordering or choosing what combinations of attacks to try first. One possibility is to give the RL agent the option of using any known attack, and it will choose which ones to exploit. However, this can be difficult to encode in a compact action space if there are many different action spaces for different attacks, which may affect learning stability. It’s the kind of thing that should be possible, but may be tricky/sensitive to hyperparameters in practice.

One broader point to consider is that many of the attacks in the figure you mentioned cannot be easily translated into a direct numeric reward that is comparable to the reward from other attacks (e.g., selfish mining and DDoS attacks have very different reward structures). In those cases, it doesn’t really make sense to use RL (or MDPs for that matter) to learn strategies combining the attacks. So there’s some domain knowledge needed to identify attacks with the same reward type and model the reward structure, so that an agent can meaningfully compare the reward(s) from different attacks.

Thanks for the question. DeFi applications are just that–applications. As you pointed out, they run on some underlying blockchain with its own (existing) consensus mechanism. This is mainly done for usability reasons, both for end users and for developers–if each DeFi app had its own blockchain with its own consensus and incentive mechanism, there would be much higher barriers to entry, and the security of each individual blockchain would be reduced due to less hash power/stake/resources backing it. So I think DeFi alone doesn’t really change anything with regards to this problem.

Actually, the same techniques we used in SquirRL could probably be used to learn strategies for trading in DeFi apps to maximize profit. DeFi is of course not fair at all, and there has been a lot of work showing how strategic users can profit off other naive users (Chainlink researchers are doing a lot of work to try to address this problem). So in that sense, many DeFi apps may be unfair both at the application layer and at the consensus layer.


Thank you for contributing, Giulia!

I noticed in the Q/A section of the webinar that you referenced how the flaws in current incentivize-based mechanisms could lead to collapse of the system attempting to be designed. I am curious as to whether or not there are any current examples of permissionless blockchains on a mass scale that you feel are showing signs of crashing because they lack the capability to fairly incentivize all miners.

1 Like

Dr. Fanti, thanks for your response. My impression (as a non-specialist reader) is that the comparative “fairness” of decentralized finance was one of its main selling points to the public at large. Therefore your statement that “DeFi is of course not fair at all” makes the general public’s perception of it as “just another scam” seem credible.

Speaking as a naive babe-in-the-woods, I thought that DeFi’s mathematicians were designing a transparent, close-to-incorruptible alternative to the world that existed circa ten years ago. What happened to that beautiful dream? Do the “researchers” you mention have any hope of resurrecting it, or have we all been out-finessed by the rich?

1 Like

Thanks for your response.

Aha! DRL could theoretically be used here, aka no need for actual human ordering of attack combinations, DRL should eventually lead to the best combo of attacks after several iterations. Understanding that this could theoretically be applied to your work, but may be tricky to apply in practice, is it something that you’ve considered and/or do you know of other researcher/s who might be attempting something like this?

Another aha! I was trying to combine apples and oranges, thanks for setting me straight. Since you mention above, that using a DRL agent to choose which known attacks to combine may be too tricky at this point, do you have plans to model other attack combinations that are apple + apple or orange + orange with SquirRL, by choosing attack combinations outright, as you did with the selfish mining + voting case study? Are there specific attack combinations you’d like to simulate in the future?


Thank you so much for this fantastic presentation! In response to the notion that it was not a Nash Equilibrium that was observed, conversely could it be inferred that too many cooperating selfish miners creates a Braess’ Paradox situation? In this case, a group of selfish miners colluding against each other and against the group effectively seem to balance the system by effectively negating the nefarious activity? I understand this may be extremely difficult to attempt to “prove”, but the collusion of 3 agents getting fewer benefits than 1-2 agents seems to have “some” effect. While it may seem counter-intuitive, is it possible that nefarious attackers forcing a system into honesty could be a way to overshoot the Nash Equilibrium into Braess’ Paradox only for that state to look like a Nash Equilibrium when the nefarious actors cancel each other out?

That is a hard question to ask in text format without getting it convoluted, so please let me know if I need to explain myself better!

1 Like

@Larry_Bates What might a Nash equilibrium look like emerging out of the current environment? What conditions might lead to Nash equilibrium based on where we are now?


Thank you for this question! Considering that the Nash Equilibrium necessitates participating parties understanding the strategies of other participants; I would imagine in the current landscape that a Nash Equilibrium could only emerge at the organizational level, as the user base is not informed enough to understand the strategies of all participants within the ecosystem.

One of the biggest hurdles to achieving a Nash Equilibrium currently is the notion of extractable value, in that extractable value does not always constitute a player incurring a loss but in many cases it does. Any time participants in a system incur a loss while someone else extracts value without the loss being a calculated step in a larger play; this becomes a hurdle to achieving a Nash Equilibrium and is still a zero-sum game or potentially even a negative-sum game in some cases.

All that to say, a Nash Equilibrium wouldn’t be absent of extractable value, but EV would not come at the loss of another player and value could only be extracted when the ecosystem produced excess or was in a situation where a product or service was valuable enough to swing the sum into a positive-sum game.

In short, the service or widget being produced within a system has to have supply and demand both effectively outpacing natural entropy in order for the system to be efficient enough to not become a negative-sum game. In other words, if the cost of running the ecosystem outpaces the profitability of the product or service being produced within the ecosystem, it becomes impossible to have a positive-sum game and in that context a Nash-Equilibrium could never actually be achieved anyway.

So to bring this full circle, a Nash Equilibrium would not be necessarily free of malicious actors; conversely, the participants in the game would be so aware of the malicious actors’ intentions that they would be able to anticipate them and thus prevent the malicious actors from unexpectedly extracting value in a situation where another player incurs a loss. One of the aspects of reaching that equilibrium is the players being able to anticipate the actions of the other players so much that the system ultimately finds balance because there are no surprises due to the players being informed.