What Constitutes a Good Test Suite?

When auditing a smart contract project, the security reviewer should have a close look at the existing tests (assess the quality of the entire test suite) and make recommendations about what should be improved.

There are many ways in which developers choose to test their code. In some cases it’s really easy to give recommendations for improvement, but other times it is not clear what can be improved or how much it should be improved. Here are just a few examples of what I’ve seen in the wild:

  1. No tests whatsoever. In this case it’s easy to recommending adding automated tests that cover all the implemented functionality, e.g. tests such as those that can be run via truffle test.

  2. Tests which cover all lines of code but have almost no assertions for checking the effects of the executed lines of code. Recommending adding more assertions is the general thing to do here, but it’s hard to quantify the amount and quality of assertions, because tools like solidity-cover do not provide any metrics for assertions. One interesting research topic here would be to come up with metrics for the quality of assertions in test suites.

  3. Tests with 100% line coverage, but <100% branch coverage. This generally means that some code paths are not tested, which might be exactly those edge cases which will lead to an unexpected loss of funds. For such test suites, it is generally recommended to add tests that will pass through those code paths which were not tested. However, it’s important to note that 100% branch coverage could still mean that some paths are not executed, because there is a significant difference between branch coverage and path coverage (see explanation here). Therefore, it’s up to the software devs/testers to think about all plausible paths and test them accordingly. One interesting research topic would be to come up with quality metrics for path coverage. Percentage is not always feasible, because many programs have an infinite number of paths.

  4. Only unit tests for all functions, but no integration tests. For such projects it’s important to recommend adding integration and end-to-end tests which combine sequences of smart contract function calls, possibly from different smart contracts. Such tests would ideally cover happy and unhappy paths in all use-cases of the project.

  5. Tests which have poorly structured code and no comments and which do not have a proper description of what the test is doing, e.g. it("should always pass"). In such cases the auditors should recommend adding comments and representative descriptions for tests, such that if a test failure occurs it can be easily understood.

This list of examples and recommendations is not exhaustive. Therefore, it would be interesting to hear more examples from other devs/auditors and discuss other possible recommendations that can be given to developers to improve their test suites.

9 Likes

On the perspective of examples and recommendations, and what developers think what good testing practices are, I recommend reading a recent paper published in ICSE’19, on the Software Engineering in Practice Track: Practitioners’ views on good software testing practices, by Kochhar et. al. Note that the paper is not specific to test suites targeting smart contracts; rather, it is a generic assessment of test suites in general.

If you allow me, I would also like to bring another perspective to your post. No doubt about the importance of testing, especially in the crypto space. However, even if we do get to establish what a good test suite is, most likely, developers will not always follow such practices. Among others, to mitigate the risk of losing market opportunity when facing time pressure, developers cut on development quality and testing to save time, putting their efforts in building a fully functional product by a business agreed deadline.

This makes sense from a market opportunity, as long as developers safeguard the code against unforeseen security flaws (e.g., by allowing a smart contract to be paused) or logical oversights (e.g, by allowing upgrade logic). Hence, since tests can never prove the absence of a bug, I would argue that auditors should prioritize making sure such safety guards are in place (which ones to use is likely to be project-specific). Lacking those guards is unacceptable for a crypto project; on the other hand, while lacking a comprehensive test suite is not ideal, it could be acceptable to some extent as long as the code to be deployed has safety guards in place.

What do you think?

3 Likes

Thanks for the elaborate reply Leo. Will check that paper out, thanks for sharing it.

Totally agree that safeguards such as pausing a contract are sort of a safety net. However, they could also be seen as a centralization of power issue, where the owner is allowed to perform this privileged action, which is not always accepted by a community who wants as much decentralization as possible.

Moreover, some attacks could cause a large damage (draining funds) in 1 transaction, which will not give the owners a chance to pause the contract. Therefore, I think such safeguards are not a great alternative to testing.

Makes sense?

4 Likes

Yeah, it does make sense. I do agree that some safety guards could rely on some sort of centralization of power; as you pointed out, it is not widely accepted. In practice, though, I do see many projects relying on some level of centralization. In many occasions, it is very hard to get a fully decentralized system right from the start; this is generally a long engineering process that takes time and effort. As a compromise, due to budget and time-pressure, projects tend to accept (at least temporarily) some level of centralization.

I want to point out something to your attention. The attack you had mentioned is in theory possible to be handled, but it requires some level of monitoring. For instance, if you continuously monitor the transactions in the mempool as a means to identify transactions that could potentially harm your smart contract (e.g., as it occurs with flash-loans), then whenever a malicious transaction is seen, you could immediately send another transaction, giving it higher gas to effectively front-run the malicious transaction. Among other things, your transaction could black list the sender of the malicious transaction (assuming your contract supports this operation) or pause the contract altogether.

What do you think?

3 Likes

Thanks for pointing that out Leo, it’s indeed an interesting solution. However, there are several questions that come to mind about your proposal:

  1. Wouldn’t this solution be more expensive and require a greater level of expertise to develop and operate than to simply write a test suite?

  2. Do you know of anyone offering this as a service? Just did a quick search for “Front running as a service” and found this post from Jun 2020 Front-Running as a Service. There’s been some back-and-forth… | by Ed Felten | Offchain Labs | Medium

  3. Defining “transactions that could potentially harm your smart contract” is not always easy. It may be the case that a legitimate whale transactions is front-run by the monitoring system. How would you avoid such situations?

I’m also thinking this discussion is slightly deviating from the actual title of “What Constitutes a Good Test Suite?” It might make sense to have a list of “Smart Contract Defences” where the different things proposed so far would be included along with things such as code analyzers, bounty programs, etc. Or would your other post A Survey on Ethereum Systems Security: Vulnerabilities, Attacks, and Defenses be the right place to include a list/discussion about defences?

2 Likes

Hi @banescusebi. Thanks for your reply. As I had indicated, the proposed solution is rather theoretical, and to be feasible in terms of costs, it would have to be provided as an off-the-shelf solution (which, to the best of my knowledge, is not offered by any company). False positives could occur - a mitigation strategy is to put such suspicious transactions on hold until someone (a central figure or not) flags it as safe.

I agree with you that we have broaden the discussion since your initial post. I will write a post later on summarizing some of the topics we discussed here, and I would love to have your input.

Again, thanks for your great comments and I believe this conversation has been very fruitful and will likely provide research insights to those who read it :)

3 Likes

Great topic for discussion, are there additionally some quantifiable metrics we can identify for which we can start to build consensus on what “good” outcomes are?

4 Likes

Hi @Vishesh thank you for joining this discussion. Some quantifiable metrics are represented by instruction coverage and branch coverage. However, other metrics is exactly what I am interested in uncovering in this discussion. Do you have suggestions for other complementary metrics?

@lnrdpss suggested looking into an ICSE paper from 2019 where researchers conducted a survey with 261 practitioners from 27 countries to investigate best practices when testing software projects. In that paper they don’t propose new metrics, instead they aim to validate 29 hypothesis shown in the image below. I think some interesting metrics could be derived from these hypothesis, e.g. the average size in LoC of test cases, number of test cases dependent on other test cases. Thoughts?

4 Likes

@Vishesh @banescusebi Thanks for your contributions and great discussion.

A metric that could be interesting is assessing the mutation score of a test suite. In summary, given a smart contract, one could generate mutations of the original code, checking if the test suite is able to find errors introduced by the mutated (faulty) code.

A mutant is said to be killed, if the test suite fails given the introduced change in the code (mutation). The score is then given by:

Mutation score = Killed mutants / Nbr. of generated mutants

A perfect score (1) is when all mutations lead to errors being identified in the test suite. However, the score is highly dependent on the number of generated mutants, which easily grows in size as the code base increases. Hence, scalability is a major problem here.

Some researchers have been studying lightweight mechanisms to estimate the mutation score; for instance, by using code quality metrics. Check “Lightweight Assessment of Test-Case Effectiveness using Source-Code-Quality Indicators”, by Grano et al., which was published in IEEE Transactions on Software Engineering, in 2019.

This could be an interesting line of research. Along the same direction as Grano’s , one would need to define a set of quality code metrics specifically targeted towards smart contracts, using those as the grounds for a prediction model to estimate the mutation score.

What do you guys think?

2 Likes

Test suite quality is an interesting problem! @Inrdpss Thanks for raising mutation testing. I have written a mutation testing tool for Solidity smart contracts, which you can find here GitHub - JoranHonig/vertigo: Mutation Testing for Ethereum Smart Contracts.

The following paper introduces Vertigo:

Practical Mutation Testing for Smart Contracts - J.J. Honig, M.H. Everts, M. Huisman

You’re right about the scalability of mutation testing, which is one of it’s biggest problems. The other is the requirement for manual evaluation of “equivalent mutations”, mutants that don’t semantically affect the code.

I’ve implemented several optimisation techniques in Vertigo so that it is possible to scale to larger projects like openzeppelin’s smart contract library.

The most important ones are:

  • Incremental mutation testing
  • Parallel mutant evaluation
  • Mutant sampling
  • Computation of equivalent mutants with trivial compiler equivalence

In its next release, solidity-coverage will report more detailed code coverage, enabling another range of optimisations.

Static assessment of the mutation score

Providing metrics in shorter timeframes has its merit, especially for adoption from the development community.

I assume that developers might benefit from using both static techniques and mutation testing; using an efficient approximation during development, while using the more accurate and computationally expensive mutation testing method in continuous integration.

3 Likes

@walker Thanks for the clarifications :slight_smile:

Is there a publicly available PDF version of Vertigo’s paper, say in arxiv, that we could refer to?

1 Like

@lnrdpss you can find a PDF version here

2 Likes

@walker Thanks for sharing.

1 Like

Generally I think, Using mutation testing to assess the effectiveness of a test suite is an interesting approach, and it has the potential to be a useful metric for evaluating the quality of a test suite. Mutation testing involves introducing intentional errors (mutations) into the code and seeing if the test suite is able to detect them. The mutation score is then calculated as the proportion of mutations that are detected by the test suite.

There are several challenges to using mutation testing in practice, including scalability issues as you mentioned. Generating a large number of mutants can be time-consuming, and it may not be practical to test all mutants in a large codebase. Additionally, mutation testing can be susceptible to false positives, where a mutant is detected as an error even though it does not actually represent a real problem in the code.

Despite these challenges, mutation testing can still be a useful tool for identifying weaknesses in a test suite and for helping to ensure that the code is thoroughly tested. It may be especially useful in the context of smart contracts, where the consequences of errors can be especially severe.

Overall, I think that using code quality metrics as a way to estimate the mutation score of a test suite is an interesting idea, and it could be a valuable contribution to the field of software testing.

1 Like

It sounds like you have done a lot of work on developing and optimizing mutation testing for smart contracts. Mutation testing is a powerful tool for evaluating the effectiveness of a test suite, but as you mentioned, it can be challenging to scale to larger projects and to eliminate false positives. Your work on developing optimization techniques, such as incremental mutation testing and parallel mutant evaluation, is very important for making mutation testing more practical and useful in the real world.

I also agree that combining static analysis techniques with mutation testing can be a useful approach, especially for developers who are trying to improve the quality of their test suites. By using static analysis tools during development to identify potential issues, and then using mutation testing to confirm the effectiveness of the test suite, developers can get a more comprehensive understanding of the quality of their code.

Thank you for sharing your work on mutation testing for smart contracts. I think it is an important area of research, and I hope that your contributions will help to make mutation testing more widely adopted and effective in practice.

1 Like

I agree, mutation testing can be a useful tool for evaluating the quality of a test suite and for identifying weaknesses in the testing process. It is especially useful for finding tests that are insufficiently isolated or that have insufficient coverage of the code.

One potential way to address the scalability issues associated with mutation testing is to use a technique called “mutation scoring,” which involves generating a smaller number of mutants and using them to estimate the overall mutation score of the test suite. This can be done by running the test suite against the mutants and calculating the proportion of mutants that are detected.

Another approach to improving the scalability of mutation testing is to use tools that automate the generation and execution of mutants. There are several tools available that can help to streamline the mutation testing process and make it more practical to use in large codebases.

It’s also important to keep in mind that mutation testing is just one tool among many that can be used to evaluate the quality of a test suite. Other approaches, such as code coverage analysis and static analysis, can also be useful for identifying weaknesses in the testing process and for ensuring that the code is thoroughly tested.

1 Like