- Users today expect more security from services that handle their data. In addition to traditional data privacy and integrity requirements, they expect transparency, which means that the service’s processing of the data is verifiable by users and trusted auditors.
- Naïve solutions can provide either privacy or transparency, but not both: for example, making all data public provides transparency but not privacy, whereas relying on a trusted server to store and process data provides privacy but not transparency.
- In our design, we combine an efficient tree structure with zero-knowledge cryptography to provide privacy, integrity, and transparency for a rich set of database operations, while achieving practical performance at scale.
Can we achieve the potentially conflicting design goals of privacy and transparency using zero-knowledge cryptography, while maintaining practical performance for a rich set of database operations?
Daniël Reijsbergen, Aung Maw, Zheng Yang, Tien Tuan Anh Dinh, and Jianying Zhou. “TAP: Transparent and Privacy-Preserving Data Services.” To appear at USENIX Security 2023.
- Data Services: In our setting, a data service collects data from users and allows them to perform a variety of SQL queries that compute aggregates over subsets, such as SUM, MIN/MAX, and QUANTILE queries.
- Binary Trees: A layered data structure consisting of nodes, such that each node points to at most two child nodes in the lower layer.
- Merkle Trees: A Merkle tree is a type of binary tree in which the hashes of individual data values are stored in the leaf nodes of the tree, and the value in each intermediate node equals the hash of the data in two underlying nodes. Given a leaf node and the root of a tree, it is easy to prove that the leaf node is indeed included in the tree through a Merkle inclusion proof.
- Cryptographic Commitments: A cryptographic commitment c(x,r) of a data value x and a random value r allows someone to commit to a value x without revealing it. Without knowledge of r, the commitment c(x,r) reveals nothing about x. A commitment scheme is called additively homomorphic if c(x, r) + c(x’, r’) = c(x+x’, r+r’).
- Zero-knowledge Range Proofs In our setting, a zero-knowledge range proof system allows someone to prove that the underlying value x in a commitment c(x,r) is within some range [a,b]. Importantly, the proof does not reveal anything about the underlying value x apart from a <= x <= b.
We present TAP, a multi-user data service that protects data privacy and integrity while providing transparency to a wide range of operations.
We illustrate TAP using three use cases: smart grids, congestion pricing, and digital advertising.
The system model of TAP consists of a server, users, auditors, and a bulletin board.
The server runs the data service by collecting user data, building and maintaining a secure data structure on top of the data, and responding to user queries.
Users upload one data entry per time slot and perform queries to compute aggregates over subsets of the data. Users also monitor the data structure to verify that their data has been included correctly.
Auditors have a powerful computer and verify some structural properties of the data structure maintained by the server.
The bulletin board (for example, a public blockchain) allows users to detect equivocation, which means that the server cannot present different versions of the data structure to different users.
Each data entry consists of a User ID, a time slot, a data value, and some attributes (e.g., neighborhood and household size in a smart grid).
We assume the following threat model: the server wants to convince users of incorrect query results, and honest-but-curious users want to use proofs and query results to learn the privacy-sensitive data of other users.
Our goal is to design a transparent data service that guarantees the following properties under the threat model:
Rich operations for multiple users: the system supports a wide range of queries on the aggregate data generated by multiple independent users.
Data privacy: a user can only learn a limited number of other users’ values by performing queries.
a. Data integrity: the server cannot change user data without being detected.
b. Transparency: for each supported query, the server cannot convince the user to accept incorrect results computed from incorrect, incomplete, or artificial data.
Efficiency: the computation, storage, and network costs at the server and the user’s client are small. Query overheads grow sublinearly with the number of users and epochs.
Various existing approaches can provide some, but not all of these properties.
- TAP uses a new tree structure to support efficient queries and proofs, which extends, and combines elements from, previous approaches such as Proofs-of-Liabilities and Merkle².
- User values are stored in sum trees, which are an extended version of Merkle trees that not only store the hashes of user values in the leaves, but also the cryptographic commitments of these values. Furthermore, the intermediate nodes do not only store hashes but also the sums of the commitments in the child nodes. (This is only possible if the commitment scheme is additively homomorphic.)
- Each sum tree corresponds to a unique combination of data attributes, and the various sum trees are combined in a prefix tree that allows users to specify the range of attributes over which they perform the query; for example, over which neighborhoods they want to compute the average energy consumption.
- In TAP, the prefix tree is append-only and chronological (batches of new leaf nodes are always added to the right of the existing leaf nodes), and the leaves in sum trees are sorted in terms of their values.
- The above properties can be verified by auditors without revealing privacy-sensitive user data: it is easy to verify that the prefix tree is append-only and chronological, and auditors can verify that the leaf values are sorted correctly using zero-knowledge range proofs.
- Each type of supported query – inclusion, sums, minima/maxima, and quantiles – has an associated proof that allows users to verify that the computation has been performed correctly.
- For example, the proof for a MAX query shows that among all sum trees, the rightmost leaf in each sum tree in the desired range is at most equal to the MAX value, and that the value of at least one of the rightmost nodes is equal to it.
- TAP also supports sums of higher (statistical) moments of the data values, which, for example, enables users to compute sample standard deviations over subsets of the data.
- TAP achieves all of the main properties of a transparent data service.
- Rich operations, including sums, sums over higher moments, minima/maxima, and quantiles, for multiple independent users.
- Privacy, because queries and proofs do not reveal user values, but only commitments. However, in some cases individual measurements may be revealed - for example, a min/max query will reveal a single data value. However, this value cannot be trivially linked to a user’s identity.
- Integrity, because if the server adds incorrect user values, then this is detected by the users who monitor their data through inclusion proofs, and modifying or removing data is detected by auditors.
- Transparency, because the server is not able to generate valid proofs for incorrect computations.
- Efficiency, as witnessed by a numerical evaluation that investigates end-to-end performance, a comparison against the most closely related baselines (although these satisfy different design goals), and scalability experiments.
- A minority of compromised users can manipulate query results, but some query types are more robust than others. (For example, to change a median, the server would need to compromise more than 50% of the users in the system).
- Our scalability experiments demonstrate that TAP can handle realistically-sized systems, such as smart grids with more than a million users, congestion pricing systems with dozens of gantries and hundreds of thousands of vehicles, or digital advertising platforms with hundreds of advertisers and tens of thousands of websites.
- Server misbehavior can be proven and disseminated through the bulletin board, for example through a blockchain smart contract.
- TAP’s security relies on users who monitor the validity of their data, for which they have an incentive if their bills/payments depend on the data.
- Our experiments are fully reproducible: the code is available on GitHub and as a virtual machine image on Amazon Web Services.
- Query results can reveal some user results: for example, a min/max query on power consumption in a neighborhood will reveal the exact power consumption of the smallest/biggest user, which may violate their privacy. However, TAP can be extended to guarantee an even stronger notion of privacy, namely differential privacy.
- There may be room for performance optimization in specific contexts, for example when the data entry for most users is zero in each time slot.
- TAP can also be extended to other types of queries, e.g., Spearman correlation.
- TAP is generally applicable to data systems where a server performs aggregate queries on user data, and in which a fixed set of users upload several data values per day.
- The operations supported by TAP allow users to compute a wide range of aggregates, such as averages, standard deviations, medians, percentiles, etc.
- Three application areas are discussed in detail in the paper: smart grids, congestion pricing, and digital advertising.
- However, new use cases for transparent data services are likely to emerge as consumers demand greater transparency for operations on their data.
- Transparent data services can be integrated with blockchains, as smart contracts can be used to verify structural properties of the tree (such as whether it is append-only and chronological) or to prove server misbehavior.