Research Summary: TAP: Transparent and Privacy-Preserving Data Services

TLDR

  • Users today expect more security from services that handle their data. In addition to traditional data privacy and integrity requirements, they expect transparency, which means that the service’s processing of the data is verifiable by users and trusted auditors.
  • Naïve solutions can provide either privacy or transparency, but not both: for example, making all data public provides transparency but not privacy, whereas relying on a trusted server to store and process data provides privacy but not transparency.
  • In our design, we combine an efficient tree structure with zero-knowledge cryptography to provide privacy, integrity, and transparency for a rich set of database operations, while achieving practical performance at scale.

Core Research Question

Can we achieve the potentially conflicting design goals of privacy and transparency using zero-knowledge cryptography, while maintaining practical performance for a rich set of database operations?

Citation

Daniël Reijsbergen, Aung Maw, Zheng Yang, Tien Tuan Anh Dinh, and Jianying Zhou. “TAP: Transparent and Privacy-Preserving Data Services.” To appear at USENIX Security 2023.

Background

  • Data Services: In our setting, a data service collects data from users and allows them to perform a variety of SQL queries that compute aggregates over subsets, such as SUM, MIN/MAX, and QUANTILE queries.
  • Binary Trees: A layered data structure consisting of nodes, such that each node points to at most two child nodes in the lower layer.
  • Merkle Trees: A Merkle tree is a type of binary tree in which the hashes of individual data values are stored in the leaf nodes of the tree, and the value in each intermediate node equals the hash of the data in two underlying nodes. Given a leaf node and the root of a tree, it is easy to prove that the leaf node is indeed included in the tree through a Merkle inclusion proof.
  • Cryptographic Commitments: A cryptographic commitment c(x,r) of a data value x and a random value r allows someone to commit to a value x without revealing it. Without knowledge of r, the commitment c(x,r) reveals nothing about x. A commitment scheme is called additively homomorphic if c(x, r) + c(x’, r’) = c(x+x’, r+r’).
  • Zero-knowledge Range Proofs In our setting, a zero-knowledge range proof system allows someone to prove that the underlying value x in a commitment c(x,r) is within some range [a,b]. Importantly, the proof does not reveal anything about the underlying value x apart from a <= x <= b.

Summary

  • We present TAP, a multi-user data service that protects data privacy and integrity while providing transparency to a wide range of operations.

  • We illustrate TAP using three use cases: smart grids, congestion pricing, and digital advertising.

  • The system model of TAP consists of a server, users, auditors, and a bulletin board.

  • The server runs the data service by collecting user data, building and maintaining a secure data structure on top of the data, and responding to user queries.

  • Users upload one data entry per time slot and perform queries to compute aggregates over subsets of the data. Users also monitor the data structure to verify that their data has been included correctly.

  • Auditors have a powerful computer and verify some structural properties of the data structure maintained by the server.

  • The bulletin board (for example, a public blockchain) allows users to detect equivocation, which means that the server cannot present different versions of the data structure to different users.

  • Each data entry consists of a User ID, a time slot, a data value, and some attributes (e.g., neighborhood and household size in a smart grid).

  • We assume the following threat model: the server wants to convince users of incorrect query results, and honest-but-curious users want to use proofs and query results to learn the privacy-sensitive data of other users.

  • Our goal is to design a transparent data service that guarantees the following properties under the threat model:

    1. Rich operations for multiple users: the system supports a wide range of queries on the aggregate data generated by multiple independent users.

    2. Data privacy: a user can only learn a limited number of other users’ values by performing queries.

    3. a. Data integrity: the server cannot change user data without being detected.
      b. Transparency: for each supported query, the server cannot convince the user to accept incorrect results computed from incorrect, incomplete, or artificial data.

    4. Efficiency: the computation, storage, and network costs at the server and the user’s client are small. Query overheads grow sublinearly with the number of users and epochs.

  • Various existing approaches can provide some, but not all of these properties.

Method

  • TAP uses a new tree structure to support efficient queries and proofs, which extends, and combines elements from, previous approaches such as Proofs-of-Liabilities and Merkle².
  • User values are stored in sum trees, which are an extended version of Merkle trees that not only store the hashes of user values in the leaves, but also the cryptographic commitments of these values. Furthermore, the intermediate nodes do not only store hashes but also the sums of the commitments in the child nodes. (This is only possible if the commitment scheme is additively homomorphic.)
  • Each sum tree corresponds to a unique combination of data attributes, and the various sum trees are combined in a prefix tree that allows users to specify the range of attributes over which they perform the query; for example, over which neighborhoods they want to compute the average energy consumption.
  • In TAP, the prefix tree is append-only and chronological (batches of new leaf nodes are always added to the right of the existing leaf nodes), and the leaves in sum trees are sorted in terms of their values.
  • The above properties can be verified by auditors without revealing privacy-sensitive user data: it is easy to verify that the prefix tree is append-only and chronological, and auditors can verify that the leaf values are sorted correctly using zero-knowledge range proofs.
  • Each type of supported query – inclusion, sums, minima/maxima, and quantiles – has an associated proof that allows users to verify that the computation has been performed correctly.
  • For example, the proof for a MAX query shows that among all sum trees, the rightmost leaf in each sum tree in the desired range is at most equal to the MAX value, and that the value of at least one of the rightmost nodes is equal to it.
  • TAP also supports sums of higher (statistical) moments of the data values, which, for example, enables users to compute sample standard deviations over subsets of the data.

Results

  • TAP achieves all of the main properties of a transparent data service.
  • Rich operations, including sums, sums over higher moments, minima/maxima, and quantiles, for multiple independent users.
  • Privacy, because queries and proofs do not reveal user values, but only commitments. However, in some cases individual measurements may be revealed - for example, a min/max query will reveal a single data value. However, this value cannot be trivially linked to a user’s identity.
  • Integrity, because if the server adds incorrect user values, then this is detected by the users who monitor their data through inclusion proofs, and modifying or removing data is detected by auditors.
  • Transparency, because the server is not able to generate valid proofs for incorrect computations.
  • Efficiency, as witnessed by a numerical evaluation that investigates end-to-end performance, a comparison against the most closely related baselines (although these satisfy different design goals), and scalability experiments.
  • A minority of compromised users can manipulate query results, but some query types are more robust than others. (For example, to change a median, the server would need to compromise more than 50% of the users in the system).

Discussion and Key Takeaways

  • Our scalability experiments demonstrate that TAP can handle realistically-sized systems, such as smart grids with more than a million users, congestion pricing systems with dozens of gantries and hundreds of thousands of vehicles, or digital advertising platforms with hundreds of advertisers and tens of thousands of websites.
  • Server misbehavior can be proven and disseminated through the bulletin board, for example through a blockchain smart contract.
  • TAP’s security relies on users who monitor the validity of their data, for which they have an incentive if their bills/payments depend on the data.
  • Our experiments are fully reproducible: the code is available on GitHub and as a virtual machine image on Amazon Web Services.

Implications and Follow-Ups

  • Query results can reveal some user results: for example, a min/max query on power consumption in a neighborhood will reveal the exact power consumption of the smallest/biggest user, which may violate their privacy. However, TAP can be extended to guarantee an even stronger notion of privacy, namely differential privacy.
  • There may be room for performance optimization in specific contexts, for example when the data entry for most users is zero in each time slot.
  • TAP can also be extended to other types of queries, e.g., Spearman correlation.

Applicability

  • TAP is generally applicable to data systems where a server performs aggregate queries on user data, and in which a fixed set of users upload several data values per day.
  • The operations supported by TAP allow users to compute a wide range of aggregates, such as averages, standard deviations, medians, percentiles, etc.
  • Three application areas are discussed in detail in the paper: smart grids, congestion pricing, and digital advertising.
  • However, new use cases for transparent data services are likely to emerge as consumers demand greater transparency for operations on their data.
  • Transparent data services can be integrated with blockchains, as smart contracts can be used to verify structural properties of the tree (such as whether it is append-only and chronological) or to prove server misbehavior.
10 Likes

Well articulated i must say…I couldn’t help wonder, Is it possible for a multiuser system to provide both privacy and transparency?:thinking:

1 Like

Hi Jas_mine: yes, we can use zero-knowledge cryptography to prove that calculations were correctly performed on a dataset, without revealing the actual data values.

Users then only have to check the correctness of their individual data values. If each user does this, then the data must be correct and the server’s calculation on it must also be correct because of the zero-knowledge proof.

So this approach provides transparency in the sense that if the server cheats, a user will detect it; but it also provides privacy because users do not learn the data of other users.

2 Likes

Hello @dreijs , nice work, I just wanted to express my understanding of your paper.TAP, is a multi-user data service that provides data privacy, integrity, and transparency for user queries. TAP is designed to store data from multiple users, thus meeting the first requirement. It protects data privacy. it helps in storing cryptographic commitments instead of
raw data, and by publishing zero-knowledge proofs. TAP’s Merkle tree structure ensures data integrity and allows users to verify the
correctness of a broad range of queries. To reason about the security of TAP, the author states that the following are crucial;

  1. that each user adds one data entry per time slot.

  2. that the set of
    users (but not their data) at each time slot is known to a superauditor (e.g., a regulator or watchdog).

3.that the fraction of
adversarial users is bounded.

For each type of query supported by TAP, I think the user can verify that the result is correct. For look-up queries, the user is provided with inclusion and non-inclusion proofs. For sum
queries, the user is given inclusion and completeness proofs of the relevant nodes in the prefix tree, with which it can verify the sum by exploiting the additively homomorphic property of commitments
in TAP.

For future purpose, I think there is to add support for additional queries to TAP, e.g., Spearman rank
correlation between a recent set of measurements and another set
of aggregates.

1 Like

Hey @dreijs, outstanding research you have here!

According to your research, auditors verify some structural properties of the data structure maintained by the server.

During this verification, the auditors are exposed to sensitive information. Is this not a sort of privacy breach?

My concern is, what is the relationship between zero- knowledge proof and auditing? Because it seems privacy can’t be achieved if auditing is involved.

Hope I’m not mixing things up.

4 Likes

Hi @Lanedot, thanks for the summary!

Just to give some additional info: the main reasoning behind the security assumptions is that if the server is able to create an unlimited number of fake/dummy users, then the server can also arbitrarily distort query results. However, if the fraction of compromised users is bounded, then some types of queries are still robust: for example, to distort the median, the server would need to compromise more than 50% of the users.

Regarding future work, there is also the implementation of a version of TAP with differential privacy, and the potential for performance optimizations in certain settings.

1 Like

Hi @Yeoriton56, auditors need to request the tree itself, but not the values. The nodes in the tree only contain counters, hashes, and cryptographic commitments, which do not reveal sensitive data.

The structural properties of the tree can be verified without revealing data values. For example, to show that the leaf values x_1, …, x_n in a sum tree are sorted, the server can provide n - 1 zero-knowledge range proofs, namely:

  • x_2 - x_1 >= 0,
  • x_3 - x_2 >= 0,
  • …,
  • x_n - x_{n-1} >= 0.

So it is possible for the auditor to verify that the data values x_1, …, x_n are stored in a particular way, without revealing any of the individual data values.

3 Likes

@dreijs thanks for your detailed explanation. I understand better now. This was helpful.