Differential privacy is a mathematically proven privacy model first developed more than a decade ago. It is used to deliberately inject noise into pseudonymized data sets to prevent the re-identification of consumers during analysis.
Why differential privacy exists
Here’s an example: In 2006, Netflix released 100 million records detailing users’ movie ratings in an effort to get outside researchers to analyze and improve their movie recommendations. The data records had been scrubbed of any directly identifiable personal data, such as user names, email addresses, or billing information.
Shortly after the data was released, researchers found they were able to reidentify individuals associated with these records by comparing Netflix rankings for obscure movies with rankings for the same movies entered into the IMDB movie database, where more personal information was available. The set of obscure movies watched by an individual formed a unique “fingerprint” that could be used to reidentify which individuals were associated with the “anonymized” records. Researchers were successful 84% of the time for users who had both Netflix and IMDB accounts.
While revealing movie ratings may seem harmless, exposing the full set of privately watched media for an individual could be key to inferring political affiliation, gender identity, or other protected data classes. Differential privacy makes it possible for companies to collect and share aggregate information about users, while maintaining the privacy of individual users. The primary mechanism to achieve that is to add random noise to the aggregate data.
Why would someone use differential privacy?
Differential privacy is used to protect consumer privacy and allow companies to collaborate under information security policies that might otherwise prohibit data sharing.
Today, many companies want to find safe, secure ways to both maintain control of their data and share it with permissioned parties. Differential privacy can help owning parties preserve the privacy of their individual records by exposing only aggregate data, therefore protect the value of their data and their customer relationships.
When would data collaborators choose to use differential privacy?
Differential privacy may be of special interest for use between parties that wish to collaborate, yet do not have a high level of trust. This is a great technique for directional insights-based use cases which do not require a high degree of precision.
For example, if you are a retailer and want to collaborate with your suppliers, you may not want or need to use differential privacy to analyze shopper insights, as both parties are invested in protecting their shared consumer information and have long-standing relationships with similar data privacy controls in place. However, if you are a financial institution with unique industry-specific data regulations with which you must comply, you may want to use differential privacy when collaborating with an airline to find joint customers most likely to respond to an offer for a comarketed credit card.
Differential privacy—not a silver bullet
It should be noted that differential privacy techniques ultimately do not prevent consumer reidentification in an open environment unless all possible data combinations factor into calculating query outputs. Many repeated queries against the same database can expose the level of noise added into the differential privacy results and allow it to be subtracted out in future queries. This is known to practitioners as “reducing the privacy budget” and represents the limits of differential privacy techniques against arbitrary data query attacks.
To prevent reidentification after an arbitrary number of queries, data collaborators must go beyond differential privacy and create transparent and auditable collaborative environments to ensure that their deidentified data cannot be queried against any personally identifiable information. It’s also important to talk with your technology providers about specific use cases and make sure differential privacy is the right technique to use.
To learn more about the power of data collaboration, read this blog.