4 min read

BigQuery differential privacy: A Powerful Tool for Protecting Privacy

BigQuery differential privacy: A Powerful Tool for Protecting Privacy
Google Cloud announced the public preview of BigQuery differential privacy, SQL building blocks that analysts and data scientists can use to anonymize their data.

Introduction

In today's data-driven world, it is more important than ever to protect the privacy of individuals. As more and more data is collected and stored, it becomes increasingly easy for companies and governments to track our movements, monitor our activities, and even predict our future behavior.

One way to protect privacy is to use differential privacy. Differential privacy is a mathematical framework that adds noise to data in a way that preserves the overall statistical properties of the data, but makes it impossible to identify individual records.

Differential privacy has been used in a variety of applications, including:

  • Healthcare research: Researchers can use differential privacy to analyze medical records without compromising patient confidentiality. For example, a study might examine the relationship between certain genetic markers and the risk of a specific disease.
  • Targeted marketing: Marketers can use differential privacy to perform data analysis on customer behavior and preferences without revealing sensitive information. For example, they could use differential privacy to identify customer segments interested in a particular product category.
  • Census data analysis: Governments can use differential privacy to analyze census data accurately while respecting citizens' privacy. By anonymizing data contributions, governments can extract valuable demographic information without compromising personal identities.

How Differential Privacy Works

Differential privacy works by adding noise to data in a way that preserves the overall statistical properties of the data, but makes it impossible to identify individual records. The amount of noise that is added depends on the sensitivity of the data. For example, data that is more sensitive, such as medical records, will require more noise to be added.

No alt text provided for this image
Credit Harvard University

There are a number of different ways to add noise to data. One common method is to add Laplace noise. Laplace noise is a random variable that has a mean of zero and a standard deviation that is proportional to the sensitivity of the data.

How does it work with Queries?

Here is a chart that shows guests arriving in a restaurant at different hours of the day (Courtesy Google BQ Documentation)

No alt text provided for this image
Guests arriving at different hours of the day

While it works great for all the hours, if you look at 1am, there is just 1 guest. Now that may not be a pretty picture from a privacy perceptive and that is where you may want to add noise

No alt text provided for this image
Image shows Noise added to original Table

To avoid this kind privacy issue, you can add random noise to the bar charts by using differential privacy. In the following comparison chart, the results are anonymized and no longer reveal individual contributions.

GoogleSQL for BigQuery uses differential privacy to protect the privacy of individuals when they query data from BigQuery. When you query a dataset with differential privacy, GoogleSQL for BigQuery will:

  • Compute per-entity aggregations for each group.
  • Limit the number of groups each entity can contribute to.
  • Clamp each per-entity aggregate contribution to be within a certain range.
  • Aggregate the clamped per-entity aggregate contributions for each group.
  • Add noise to the final aggregate value for each group.
  • Compute a noisy entity count for each group and eliminate groups with few entities.

The final result is a dataset where each group has noisy aggregate results and small groups have been eliminated. This protects the privacy of individuals while still allowing you to query the data and get meaningful results.

What can I do with BigQuery differential privacy?

  • Anonymize results with individual-record privacy
  • Anonymize results without copying or moving your data, including data from AWS and Azure with BigQuery Omni
  • Anonymize results that are sent to Dataform pipelines so that they can be consumed by other applications
  • Anonymize results that are sent to Apache Spark stored procedures
Tumult Analytics
  • [Coming soon] Use differential privacy with authorized views and authorized routines
  • [Coming soon] Share anonymized data with BigQuery Data Clean Rooms

Here are some examples of how differential privacy has been used in practice:

  • In 2017, the US Census Bureau announced that it would use differential privacy to protect the privacy of individuals in the 2020 census (Source - Here)
  • In 2018, Google announced that it would use differential privacy to protect the privacy of users in its Search results.
  • In 2019, Apple announced that it would use differential privacy to protect the privacy of users in its Health app.

As the use of big data continues to grow, differential privacy will become an increasingly important tool for protecting privacy.

As Google Cloud continues to bring more of these features in their Clean Room solutions, I will share more information on this

Learn more about BigQuery differential privacy at:

Please comment if you want to learn more about this interesting feature!