Privacy-preserving computation for data mining

Brickell, Justin Lee

Privacy-preserving computation for data mining

Date

2009-05

Authors

Brickell, Justin Lee

Abstract

As data mining matures as a field and develops more powerful algorithms for discovering and exploiting patterns in data, the amount of data about individuals that is collected and stored continues to rapidly increase. This increase in data heightens concerns that data mining violates individual privacy. The goal of data mining is to derive aggregate conclusions, which should not reveal sensitive information. However, the data-mining algorithms run on databases containing information about individuals which may be sensitive. The goal of privacy-preserving data mining is to provide high-quality aggregate conclusions while protecting the privacy of the constituent individuals. The field of "privacy-preserving data mining" encompasses a wide variety of different techniques and approaches, and considers many different threat and trust models. Some techniques use perturbation, where noise is added (either directly to the database that is the input to the algorithm or to the output of queries) to obscure values of sensitive attributes; some use generalization, where identifying attributes are given less specific values; and some use cryp- tography, where joint computations between multiple parties are performed on encrypted data to hide inputs. Because these approaches are applied to different scenarios with different threat models, their overall e ectiveness and privacy properties are incomparable. In this thesis I take a pragmatic approach to privacy-preserving data mining and attempt to determine which techniques are suitable to real-world problems that a data miner might wish to solve, such as evaluating and learning decision-tree classifiers. I show that popular techniques for sanitizing databases prior to publication either fail to provide any meaningful privacy guarantees, or else degrade the data to the point of having only negligible data-mining utility. Cryptographic techniques for secure multi-party computation are a natural alternative to sanitized data publication, and guarantee the privacy of inputs by performing computations on encrypted data. Because of its heavy reliance on public-key cryptography, it is conventionally thought to be too slow to apply to real-world problems. I show that tailor-made protocols for specific data-mining problems can be made fast enough to run on real-world problems, and I strengthen this claim with empirical runtime analysis using prototype implementations. I also expand the use of secure computation beyond its traditional scope of applying a known algorithm to private inputs by showing how it can be used to e ciently apply a private algorithm, chosen from a specific class of algorithms, to a private input.