Analyzing Data while Protecting Privacy – A Differential Privacy Case Study

Abstract

Analysis and sharing of aggregate data (e.g., number of users whose favorite color is blue) is crucial to understanding patterns of the population being studied. The problem is that even summary data can expose the individuals who make up this information. Therefore, we are studying the use of privacy preserving data analysis techniques that will protect the individual and allow analysts to understand general trends. We believe that while there are unique perturbations of the data with these techniques, analysts will be able to use the data and protect the users at the same time. The case study involves applying these algorithms to MSN reporting data. One group focused on event usage (e.g., number of page views on Moneycentral.com) while another group focused on usage data (e.g., the number and frequency of users who use MSN Messenger). After four months of analysis, the first group did not like the effects of privacy preserving data analysis. In contract, the second group is prepared to use this type of reporting in production. Further discussions and inquiries revealed that the first group had perceived data trust issues that conflicted with use of the algorithms. But, the second group had no trust issues with the data and was very familiar with privacy issues, hence the successful outcome. Therefore, if users trust the data it is possible to use this privacy preserving algorithm to allow the analysis of data while protecting privacy.

Additional Resources

Analyzing Data while Protecting Privacy via Differential Privacy: a Clinical Researcher’s Perspective

Protecting patient data confidentiality using differential privacy [CMU Mindswap for Privacy]

Differential Privacy: Case Studies

Privacy Preserving Data Analysis Poster board

Introduction

Based on an analysis of the 1990 US Census, 87% of the United States population is uniquely identifiable by the three attributes of zip code, date of birth, and gender [1]. Latanya Sweeney’s breakthrough research with the Governor William Weld case clearly depicts that the masking or hiding of publicly available information does not adequately protect the privacy of an individual. Sweeney was able to obtain masked medical data from the Group Insurance Commission of Massachusetts whom is responsible for state employees’ health insurance. The masked medical data contained only non-identifiable information such as ethnicity, visit date, diagnosis, procedure, medication, total charge, zip code, date of birth, and gender. For a cost of $20, Sweeney then obtained the publicly available Cambridge, MA voter list. Referring to the Venn diagram below, each circle graphically represents the two data sources and by itself reveals little information. The masked medical data on the left provides only medical information where only gender, date of birth, and zip code were identified. The voter list on the right provides personal information but limited only to voting information, address, date of birth, and gender. But recall that 87% of the US population is uniquely identifiable by only the three attributes of zip code, date of birth, and gender. In the case of the Massachusetts governor, there were only six people in Cambridge, MA whom had the same date of birth and only three of them were men. Of the men remaining, he was the only person to live in his particular zip code. Through these two pieces of information joined only by gender, date of birth, and zip code; Sweeney was able to identify and reveal the medical records of then Governor William Weld.

privacy_fig1

As one can see, this particular scenario stresses the inadequacies of ensuring privacy by masking medical data; it will require us to change our perceptions of what data privacy really entails. It takes only a small amount of information to reveal the individual underneath the data.

Many current privacy solutions involve the concept of masking the data such that personally identifiable information remains hidden. Regulatory bodies such as HIPAA provide very explicit instructions on what information can and cannot be revealed in different scenarios [2]. Businesses such as Microsoft® provide detailed statements (e.g., “Microsoft Online Privacy Statement”) detailing the personal information collected, uses of the information, and the choice to opt-out [3]. Academic institutions, such as Oregon Health & Science University, have requirements in addition to HIPAA standards staging the permitted uses and disclosures associated with protected health information [4]. Yet, as one can see from the William Weld case, the masking and explicit statements of use and disclosures may not necessarily insure privacy. There are many other masking solutions and/or perturbations of data as can be seen in J.C. Cannon and A. Cavoukian’s book “Privacy: What Developers and IT Professionals Should Know” [5]. But the potential weakness of these approaches is that the solutions may end up being too complicated to implement or that the modification of data will result in changing the meaning of the data. This latter point is especially important because of its affect on analysis and research. If statistical analysis ranging from mean/median to advanced data mining techniques result in inconsistent values, the results and conclusions based on these statistics may be incorrect.

How is it then possible to protect an individual’s privacy in a relatively simple manner while ensuring consistent statistical analysis for research purposes? This is the purpose of “Privacy Preserving Data Analysis” which is based on the work of Microsoft Research (MSR) researchers Cynthia Dwork and Frank McSherry (as well as S Chawla, K Talwar, A Blum, K Nissim, and A Smith) [6, 7, 8]. The basic premise of their research is that the addition of noise based on the exponential distribution to the data will be able to protect the individuals underneath the aggregate data. Recall that to attack aggregate data, one need only to ask enough questions to drill down to a specific individual with very distinct attributes. In the Governor Weld case, the first question would be the number of people who had his date of birth (answer is 6). The next question is number of those people who were male (answer is 3). The final question is the number of those people who lived in his 5-digit zip code (answer is 1). In this example, it required only three questions for the author, Sweeney, to drill down to Governor Weld and his medical records [6]. But, if we were to add exponential noise (some integer value between –∞ to +∞) to the consistently change the above values, it would be impossible to drill down to this one person. For example, if the noise values to be applied to these questions were that of (-2, +1, +6), then the results of the above questions would be:

privacy_qs

When looking at the PPH-applied answers, the additional noise has made it impossible to drill down to the single individual, therefore the data is “privacy preserving”. In this particular example, the noise is too large but in the methods section we describe how to better calibrate the noise. Ultimately, the key is to add enough noise to the system so that the answer changes but the overall statistics do not. To prevent attacks to the noise algorithm itself, more noise needs to be added as more questions are asked. The background section of this paper provides more details on how this works. The efficacy of this methodology is not in question for this case study as this has already been amply researched and verified. Therefore, the purpose of this case study is to determine the applicability of this noise algorithm in an actual reporting environment.

Background

The privacy preserving data analysis concepts involve the addition of noise to the data in order to protect the individuals underneath the data. This case study involved the use of the privacy preserving histogram (PPH), which uses exponentially-distributed noise (the algorithm is better described in the methods section). These concepts are based on the research in “Practical Privacy: The SuLQ Framework” [7]. The next few background sections describe both the sanitization concept and the privacy mechanism theorem associated with it.

Sanitization Concept

To protect the individuals comprising the data, we will mask these individuals within the data by creating a sanitization point between the user interface and the data.

privacy_fig2

The left green shape on the left represents the database while the right-most circle represents the answer or result set provided to the user interface. The middle section is the interactive sanitizer, defined К, where it introduces random noise to produce uncertainty, hence privacy. Note the magnitude of the noise is given by the theorem:

If many queries f1, f2, … are to be made, noise proportional to Σ­iΔfi suffices. For many sequences, we can often use less noise than Σ­iΔfi . Note that Δ Histogram = 1, independent of number of cells

The sanitizer requires the creation of a carefully detailed algorithm and will be discussed in more detail in the differential privacy section below. A safe answer on the amount of noise to apply is that of standard deviation equal to the total number of queries. If the number of queries is not known, then the standard deviation can be proportional to the square of the queries asked so far. As for the noise itself, it should be newly seeded each time a query is applied.

By doing this, this algorithm will be able to address all attacks. Consequently, for each person, the increase in probability of the individual being attacked (or anyone else for that matter) due to the contribution of their data is nominal. The example given is foiled for two reasons: a) the addition of noise will (formally) complicate the polynomial reconstruction and b) the number of queries is limited by the degree of privacy guaranteed, and N is generally going to be way too many queries.

Mathematically, at this sanitization point, the PPH will apply of a little bit of noise within each cell of the histogram. Descriptively, if one’s result set from the database is a report with a set of rows and columns, for each value within each row and column cell a small bit of error is added to the original number. Provided that the sanitization point can limit the number of questions being asked and/or add more noise as more questions are being asked, then the PPH can guarantee the privacy of the individuals that make up these aggregates.

Differential Privacy

Our privacy mechanism, К, gives ε-differential privacy for all transcripts t, all databases DB, and all data items (rows in the DB) Me, the ratio

privacy_fig3

The differential privacy theorem is a simple mathematical formula that describes the fact that nothing more can be learned about an individual when her information is in the DB (DB+Me) than when it is not in the DB (DB-Me). This is important from the context of joining one set of data that is in the DB (e.g., the masked medical data) with another set of data that is not in the DB (e.g., Cambridge, MA voter list). If there is no noticeable difference between f(DB-Me) and f(DB+Me) then there is no perceptible risk by joining the two data sets together. Therefore, to achieve this differential privacy, we will need to add scaled symmetric noise exp(-|x| ε/ Δf).

privacy_fig4

Note that possible responses, R, is defined as R = Δf / ε. The black line represents the f(DB-Me) while the red line represents the f(DB+Me). This means that increasing the value of R will flatten the curve; the flatter the curve the more privacy is provided. The flatter curve means that no response is much more likely in one case in comparison to other.

The question that is incurred then is what difference must noise obscure or how much can f(DB+Me) exceed f(DB-Me). This goes back to the above noted formula of differential privacy:

 Δ f = maxDB, Me |f(DB+Me) – f(DB-Me)|

where the noise depends on the sensitivity of the system.

privacy_fig5

 

Basically, this example shows the real value is f(DB-Me) = 104 while the added noise value of f(DB+Me) = 105.2. The statistical difference between these two values seems insignificant, but this depends on the sensitivity of one’s statistics. If you have highly sensitive data where a 1.2 difference results in a $1.2 billion sale price, then any change to the data is unacceptable. While if your data is relatively insensitive (and most data analyses are), this slight error of difference is well within an acceptable error margin. A key presumption of this design is that one will need to calibrate the noise to ε and the number of queries. More explicitly to this latter point, as the number of queries increase, the more noise will need to be added in order to insure that the noise algorithm itself can counter an attack. The vast majority of analytical queries do not require a substantial increase in noise due to the relatively low number of questions asked. As seen below, noise is independent of the database size so privacy is insured but accuracy varies; the larger the database the higher the accuracy.

privacy_fig6

 

Case Study Data

This case study used MSN visitor data where the privacy preserving noise algorithms were applied to the data to determine reporting efficacy. MSN data has 550 million Passport users (as of February 2006) that contain visitor self-reported data such as gender, birth date, occupation, country, and zip code. The web traffic data associated with MSN also has additional attributes including IP address, pages viewed, page view duration, browser, and operating system. MSN uses this data to provide customizable experiences of their users and to better understand how visitors are using the various services. But at the same time, there are various major privacy issues including identity theft, fraud, and/or bad press (e.g., AOL released search engine queries that ended up revealing their users). If user expectations are not satisfied, customers will no longer trust the services provided. As data is accumulated, it becomes easier to segment the population and potentially identify individual users without directly using personally identifiable information. Hence, there was an interest within MSN to use the MSR research to determine the latter’s applicability for all reporting.

Methods

We have devised a way to mask the individuals within the aggregate data using the privacy preserving histogram without losing the meaning of the data. To test the applicability of this technique within a reporting environment, we used MSN visitor data and had two different analytics groups analyze this data. The “Sampled Users Web Analytics” group is a set of users who want to understand what MSN web sites people are using (e.g. the MSN home page, Moneycentral, MSN Autos, etc.). The “Customer Churn analytics” group is a set of users who are trying to better understand how customers are using the features of MSN (e.g. Messenger, Search, Hotmail, etc.). Considerable amount of time was spent with both groups to determine the reporting features and the metrics they wanted. Upon determination, an Analysis Services database was also created so customers could easily pivot through all of the information and perform ad-hoc queries. Once they were happy with the reports, the PPH was introduced so they could view the reports and provide feedback on the issues between the original reports and the reports with PPH applied.

User Groups

Sampled Users Web Analytics Group

To provide the reports that this group desired, we built a new reporting solution based on an existing web analytics solution. Upon loading the data into a SQL Server 2005 database for filtering and transformation, an Analysis Services 2005 cube (OLAP cube) was built. It provided the analysts with the ability to view the data from multiple dimensions using a custom web UI. Multiple iterations had occurred with this group to determine the exact reports they would like to view (by what event, demographics, time period, etc.). In this particular case, the reporting was specific for the Channel, Search, and Money teams.

Customer Churn Analysis Group

To provide the reports that this group desired, we built a brand new reporting solution. The customers were familiar with this data because it was built on an existing targeted marketing system. Similar to the “Sampled Users Web Analytics Group”; a SQL Server 2005 database was created to initially filter and transform the data. Upon processing completion, an OLAP cube and a custom web UI was built on this data to provide the reporting interface. This new reporting system allowed the analyst to understand how MSN services (Messenger, Mail, Search, Spaces, etc.) were being used. Multiple iterations were also performed with this customer group insuring they had received the data they desired. A key difference between the groups is while the first group was able to use legacy reporting systems; the second group did not have access to any reporting. Within a few weeks of their request to us, these customers had received a working beta version that they could use and provide feedback.

 

PPH: Generating the Noise

In both types of reporting solutions (and for that matter, many reporting solutions), we are primarily concerned with counts and summations. Specifically, these reporting systems involve unique visitor counts, event transaction counts, and page view summations. As we are dealing with discrete data, instead of continuous data, the basic PPH algorithm to implement is:

prg(R, Seed for A) = (n1, n2, …, nn)

where

  • prg(): This is the pseudo-random number generator (RNG) such as SRAND
  • R: This is the magnitude of the noise to be applied to the system.  The larger the number, the more error is added.  More error means higher privacy but could skew the meaning of the numbers of the value is too high.

The routine, prg(), is based on the property that for any interval [x,z] of the density function

privacy_df

we can analytically find the point y such that p([x, y]) = p([y, z]). A single random bit can tell us on which side of y our sample should be drawn, and we can recursively apply the technique to the appropriate subinterval (i.e.: either one of [x, y] or [y, z]).

To generate the noise, the RNG will create a stream of numbers, for example:

 

0 0 1 1 1 1 0 0 0 0 1

The resulting exponential distribution translation of this stream is:

. 2 + 1 + . . . . 6

Recall that the point of the PPH is to generate noise that will be applied to the result set data. The generation of noise uses the above RNG (there is no such thing as a true random number generator in mathematics), to create the stream of 0s and 1s. We then take the stream of numbers and translate it into the noise. For example, if you look at the right most set of numbers (starting after the …) within the array, you’ll notice 1, then 5 0s, then 1 again. The first 1 denotes a positive value (vs. negative), the 5 0s represent a count of 5, and the final 1 represents the full stop. Hence, these seven values represent a positive 5 count + 1 count with full stop equaling +6. For a code example, please refer to the Appendix.

Effect on Data

This stream of numbers is then applied directly to the results. Recall, that the stream of noise values was -2, +1, …, +6 and the original values are that of A = 36, B = 22, …, N = 102.

privacy_fig7

The noise added to the system provides a new result of A = 34, B = 23, …, N = 108. By adding this noise to the system, it is not possible for one to drill down to a value of 1, which would expose a single individual (e.g., A = 1). After all, if you are able to drill down to one individual, even through what is perceived as non-identifiable attributes (e.g., birth date, zip code, and gender); it becomes possible to identify this individual. But the additional noise will never allow you to get down to a single individual of any attributes (or combination thereof). If the numeric answer to your question represents a single individual (e.g., A = 1), one is not sure if this truly represents an individual or if this is due to the noise applied.

An issue of concern is if we apply too much noise (e.g., +1000) then we risk changing the meaning of the data completely. If too little noise (e.g., +0, +1, -0, -1, etc.) is applied we risk not protecting the data very well. Hence, we followed the safe rule that the magnitude of the noise applied to the RNG is the standard deviation of the total number of queries. In our case, the total number of queries per user ranged from 90 to 110 queries. Therefore, the standard deviation was 10 and the magnitude applied to the RNG, R, is 10. We limited the number of questions asked of the system otherwise it would have become possible to break the noise pattern. Since the noise generated by the RNG requires some seed value (i.e., a starting point), if you ask enough questions it becomes mathematically possible to determine how the RNG is generating its values. Once this is determined, it becomes possible to reverse engineer the system and determine the original set of values. Hence the importance of either limiting the number of questions asked (e.g., a hundred questions) or adding more noise to the system as more questions are being asked thus preventing an attack of this nature.

Results

The effect on the customer data by the application of noise is reflected in the tables below. In Figure 8, the left and right tables are two of the same queries executed a few seconds of each other with the PPH noise applied.

privacy_fig8

For example, reviewing the country Afghanistan, the “Unknown” value is 121561 in one case and 121599 in another. Because of this random exponentially distributed noise, we do not know what is the “real” value. The visual effect to this data is minimal, as depicted in the image below.

privacy_fig9

Sample Users Web Analytics Group

We first received feedback from the “Sample Users Web Analytics Group” which wanted to understand what visitors were using on the MSN.com and Windows Live web sites. The concept of PPH was introduced to this group after building a new reporting solution on an existing web analytics solution specific to their business requirements. As well, an Analysis Services database was built for this new reporting solution to allow the group the ability to view the data from multiple dimensions – something that they had desired for quite some time.

privacy_fig10

 

Unfortunately, the resulting feedback was negative; the basic problem was that this group could not accept any amount of error in their data. For example, an initial query provided the number of visitors per country, where the total US visitors was 202. A subsequent query against the data which broke out the same US visitors by gender would result in a total that was 203. While it could be understood that the added noise to the system could explain these differences, this group insisted that these numbers had to match at all times. Note, it did not matter that these numbers were not used for financial reconciliation and was only to be used for analysis purposes.

From further discussions with this group, it appeared that they had been utilizing reporting systems that had perceived accuracy issues. It is this perception that resulted in the groups’ negative feedback to additional noise being added to the resulting data.

Customer Churn Analysis Group

Based on this feedback, we were originally a little disheartened about the prospects of applying this type of privacy functionality into reporting systems. But after some conversations with other analysts (including members in the healthcare field), it appeared that they did not have issues with the effects of the additional noise to insure individual privacy.

Therefore, we proceeded with the next case study group: the Customer Churn Analysis group. Upon determining the group’s reporting requirements, an Analysis Services database was built based on an existing targeted marketing system. Like the first group, there were multiple iterations of this database so the analysts were better able to understand how MSN services such as Messenger, Mail, Search, and Spaces were being used. The major difference between these two groups was that the “Customer Churn Analysis Group” never had access to reporting data before. To their surprise, within a few weeks of their initial request they were able to interact, validate, and provide feedback on a working reporting solution. Most importantly, they were able to analyze and re-analyze the data to determine the precision and accuracy of the data.

To our delight, once we had introduced PPH into their churn reports (with the additional noise being seen in the result set), they were satisfied with the results. It even got to the point where when the customers were viewing the reports, they had forgotten that a privacy algorithm was applied – even though the numbers in the reports were changing (due to the additional noise) and the statement “Privacy Preserving Histogram Applied” was applied directly into the reports. Based on conversations and surveys, it was very apparent that the collaborative effort between us and the customer led to the customer trusting the data – this is a key difference in comparison to the first group. And because of this trust, the small amount of error introduced into the system to ensure customer privacy was well within a tolerable error margin. It also helped that this group is a direct marketing group; so they were more familiar with the concept of customer privacy.

Discussion

Ultimately, it was very important to develop trust of the data to accept the application of the privacy preserving histogram. That is, if customers trust and understand the data, the side effects of the privacy preserving histogram algorithm are well within a tolerable error margin. Customers who regularly deal with personally identifiable information are more interested in this concept and more willing to accept the additional noise. More continued analysis will be required to insure that its applicability is not just limited to these types of reporting scenarios, but the case study does provide us with enough positive feedback for further future analysis.

References

[1]. Sweeney L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002: 10 (5), pp. 557-570.

[2] US Department of Health and Human Services (Office for Civil Rights), HIPAA Administrative Simplification [Regulated Text]. February 16, 2006. Available from: http://www.hhs.gov/ocr/AdminSimpRegText.pdf

[3] MSN, Microsoft Online Privacy Notice. C2007. Available from: http://privacy.microsoft.com/

[4] OHSU Healthcare System (Administrative Policy Manual), Permitted Uses & Disclosures of Protected Health Information (ADM 04.22). April 14, 2004. Available from: http://www.ohsu.edu/cc/hipaa/docs/04-22.shtml

[5]. Cannon JC, Cavoukian A. Privacy: What Developers and IT Professionals Should Know. Boston, MA. Addison-Wesley Professional. 2004.

[6]. Dwork C, McSherry F, Nissim K, Smith A. Calibrating Noise to Sensitivity in Private Data Analysis. The Third Theory of Cryptography Conference (TTC 2006); 2006 March 4-7; New York, NY: International Association for Cryptologic Research; 2006. pp. 265-284.

[7]. Blum A, Dwork C, McSherry F, Nissim K. Practical Privacy: The SuLQ Framework. The 24th Symposium on Principles of Database Systems (PODS 2005); 2005 June 13-16; Baltimore, MD: Association for Computing Memory; 2005. pp. 128-138.

[8]. Chawla S, Dwork C, McSherry F, Talwar K. On the Utility of Privacy-Preserving Histograms [link on the internet]. The 21st Conference on Uncertainty in Artificial Intelligence (UAI 2005); 2005 July 26-29; Edinburgh, Scotland: Association for Uncertainty in Artificial Intelligence; 2005. http://research.microsoft.com/research/sv/DatabasePrivacy/cdmt.pdf

[9]. HIMSS Privacy and Security Toolkit [homepage on the internet]. Chicago, IL. C2007. Available from: http://www.himss.org/ASP/privacySecurityTree.asp?faid=78&tid=4

Appendix

Code Example#include “stdio.h”
#include “stdlib.h”
#include “string.h”
#include “math.h”
#include “time.h”
 
/* The following routine is based on the happy property that for any interval [x,z] of the density function
 
            p(x) ~= exp(-|x|/R)
           
we can analytically find the point y such that p([x,y]) = p([y,z]). A single random bit can tell us on which side of y our sample should be drawn, and we can recursively apply the technique to the appropriate subinterval (ie: either one of [x,y] or [y,z]). */
 
/* The way we get at random bits */
/* This can be extended/altered for an arbitrary pseodorandom generator */
int randombit() {return(rand()%2);}
 
/* Produces integer valued noise */
int intnoise(double R)
{
      double tally = 0.0;
//    double tally = 0.5 – R * log((exp(1.0/R) + 1.0)/2.0);
     
      // stage 1: flip a lot of coins
      while (randombit() != 0)
            tally += R * log(1.0/2.0);
 
      // stage 2: now sample from this interval
      double x = R * log(1.0/2.0);
     
      // loop until the integer part is fixed
      while (round(tally) != round(tally+x))
      {
            double midpoint = R * log((1.0 + exp(x/R))/2.0);
           
            // grab one random bit.
            if (randombit())
            {    
                  tally += midpoint;
                  x = x – midpoint;
            }
            else
                  x = midpoint;
      }
     
      return(randombit() ? round(tally) : -round(tally));
}
 
/* First argument is the magnitude R of the noise. Second parameter is number of trials to conduct. */
int main(int argc, char** argv)
{
      srand(time(NULL));      // sets random seed to a “random” value. This

 should be done carefully, but ok for testing.

for (int i = 0; i < atoi(argv[2]); i++)
            printf(“noise: %03d\n”, intnoise(atof(argv[1])));

}

3 Comments

  1. i think my brain just exploded
     
    nice! 

  2. We have a data base that combined over 2 million medical claims records and I am the sole consumer advocate on the advisory committee and this would be a great article to share with my technical partners.I wish I could print it. Oddly enough when you try it takes you to a list of 30 lovely articles in French! http://denster.spaces.live.com/mmm2008-02-07_16.56/#Then again I am using Firefox..

  3. […] privacy mechanisms such as k-anonymity or privacy preserving histogram such as episilon noise via Analyzing Data while Protecting Privacy – A Case Study Share this:TwitterFacebookMoreLinkedInDiggRedditPrintEmailStumbleUponLike this:LikeBe the first to […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s