Skip to content

Latest commit

 

History

History
357 lines (198 loc) · 36.9 KB

2016-02-06.md

File metadata and controls

357 lines (198 loc) · 36.9 KB

Differential privacy: An illustrated primer

I recently mistakenly titled a fairly advanced post "Differential privacy for dummies", and at least a few folks expected introductory helpful material and instead found ... something else, let's say.

So let's try and explain differential privacy for non-experts, somewhat clearly. Even for experts, there may be a few important points in here that you haven't seen before. We'll also work through some examples, including one way to evaluate the correlation coefficient of two datasets, which I know some authors are eager to see.

Differential privacy

There are a large number of definitions of privacy out there. Try and write down what privacy means to you, and you start to understand that it might be a little subtle to really nail down precisely. Some of these privacy definitions are philosophical, for example Warren and Brandeis ("the right to be let alone") or Tore Dalenius ("nothing new should be learned about you specifically when one analyzes your data"). Some are more technical, and have markedly less concise quotes describing their intent. Each definition attempts to describe and motivate a certain flavor of privacy, and while this is an excellent scholarly pursuit, it doesn't often translate into immediately helpful advice for the concerned data subject.

Differential privacy is yet another privacy definition. It is certainly both philosophical and technical, but more importantly it is actionable: it directly attacks the specific concern data subjects have: what will be the consequences of participation in this data set?

Differential privacy guarantees that nothing that could not happen without access to your data will happen with access to your data.

In fact it makes a stronger quantitative guarantee: The chance that any specific thing happens (really, anything at all) with access to your data is at most a multiple X of the chance it would happen without your data. That "multiple" X is part of the guarantee and it determines how much privacy you get: a value of 1.0 would be perfect privacy (which by definition ignores your data), small values like 1.01 are pretty great, whereas values like 10.0 are less amazing but still non-trivial.

What sorts of things does differential privacy ensure don't happen now that couldn't happen before? Really, anything:

  • Are you worried that by participating you might get more junk mail and phone calls? The expected number will increase by no more than a factor of X.

  • Are you worried that your embarrassing on-line searches might show up on the front page of the New York Times? It is only X times more likely than if you hadn't even made them (probably pretty unlikely in that case, right?)

  • Are you worried that by participating in a phone survey, your favorite team will lose the Superbowl? You just can't shift the odds that much, sorry.

This is a fairly strong guarantee. In fact, it is a lot stronger than most people realize.

Differential privacy even protects you from privacy non-experts

Almost all of the existing privacy definitions frame some sort of "privacy violation", and then argue one way or the other that such a violation couldn't happen. For example "re-identification" is when someone (typically a smart privacy expert) is able to link your private data back to your public identity. That could be a problem, but it isn't the only problem. Other definitions protect you from a hypothetical perfect mathematical reasoner with outside information who wants to know your private data (differential privacy does, for example). This is also good to guard against, but it isn't the only problem. Generally, smart well-informed statistically-minded people are not the problem (in life, generally).

Rather, here is a problem that almost no privacy definitions guard against: you are a brown person living in rural America, and perhaps you happen to follow a conservative interpretation of Islam. You get a totally anonymous form to fill out, indicating yes or no about Islamism. There is no way to link back to who you are, no personally identifying information, nothing. How do you fill out the form?

If the results are published, just the totals with no identifying information, and the number of Islamists in your small town is greater than zero, you can bet that as the brownest person around you are going to get beat on by some good ol' boys right quick. They don't have a fancy education, they didn't re-link the published data with auxiliary side information, they didn't consult with the newest research on re-identification. They just saw some number they didn't like and figured they should do something about it.

So, privacy accomplished? Which privacy philosopher do you have to thank for your new busted-ass condition?

Differential privacy's strong guarantees protects you even from "non-experts". Your participation is protected from the malicious boss who needs any sort of evidence to can you, from the abusive spouse who is just looking for something to set them off, from the law enforcement officer who is going to raid whichever apartment seems the most suspicious. Differential privacy also protects you from that weird ass thing you had no idea would be a problem because really wtf who knew about that 20 years ago? It is pretty serious business.

Differential privacy can't promise you that you will have a long, happy, and uneventful life, but it can promise that participating in differentially private computations will not be the reason shitty things happen to you.

Too good to be true?

It certainly sounds good. There are going to be some trade-offs to discuss, but let's first talk through the formal definition and see how it is possible to provide these sorts of guarantees.

Differential privacy is applied to computations, things you do with data. The important observation (assumption?) is that the rest of the world just reacts to what the computation produces, it doesn't otherwise notice anything about your data. If the computation produces outputs similarly with and without any one individual input datum, it might be a candidate for differential privacy.

The formal version says that a randomized computation has epsilon-differential privacy if: the addition or removal of any one input record from any possible input dataset changes the probability the randomized computation produces any possible output by at most a factor of exp(epsilon).

There are a lot of "any"s up there. This is differential privacy being careful, and making sure that the randomized computation covers all the cases. Let's go through each of them just to be sure:

  1. "any one input record": We don't know what your secrets are, and what data you might contribute. You want guarantees no matter what your data are.

  2. "any possible input dataset": You don't know what the rest of the dataset looks like, and want guarantees independent of what it looks like.

  3. "any possible output": We don't know which outputs you are worried about making more or less likely. You probably don't know which outputs you are worried about making more or less likely. Best to be sure.

There is only one thing that doesn't have an "any" in front of it, but it is a bit hidden in the definition. A randomized computation gets provided with random bits as input (imagine a random number generator) to help it make random choices. The randomized computation is actually a deterministic computation applied to input data and input "random" bits. The computation is allowed to produce different outputs for different input data and bits. However, the number of different input bits for which some output gets produced cannot change too much. That number of bits is what determines the probability that the randomized computation produces the output.

Perhaps at this point you start to see the core issue with differential privacy: randomness. All non-trivial differentially private computations involve randomness, and the amount of randomness is probably (yup) going to increase as we want better and better privacy guarantees. Enormous gobs of randomness are great for privacy, but they are probably going to antagonize the nice folks who wanted to see the results of the computation in the first place. "Too bad", you might say, but if a privacy control isn't serving both parties, producers and consumers, you probably shouldn't have even bothered in the first place.

How big are these gobs, then?

Let's work through an example differentially private computation, probably the easiest one out there, and get a sense for how much damage differential privacy might do. Importantly, we are going to see examples where differential privacy introduces some amount of error, but not proofs that differential privacy must introduce that much error.

Imagine we want to count the number of records in our input dataset. That seems pretty primitive, but we'll build up from this. Counting and reporting the number of records in the input dataset does not by itself provide differential privacy, because if you add or remove a record you absolutely change the output. The computation has no randomness, so this just isn't going to work.

A simple remedy is to add some random error to the reported count. To satisfy differential privacy, we need to add random error with the property that if I add or remove a record, changing the true answer by at most one, I don't cause any output to become more or less likely than by a factor of exp(epsilon). Some answers can become more likely, just not by too much.

The probability that we see any specific output val from our count is the probability that the error happens to be val - count. This probability could be a few things, but we need to make sure that it isn't too far from the probability for val - (count + 1), because that is the probability we see val after we add a record to the dataset. Likewise, it can't be too far from val - (count - 1).

Pr[error = val] < Pr[error = val + 1] * exp(epsilon)
Pr[error = val] > Pr[error = val - 1] / exp(epsilon)

It turns out that these are the only sorts of constraints, and if we just try and make the probabilities get as small as possible for larger and larger error, we arrive at something called the Laplace distribution.

The Laplace distribution describes some random error that drops off exponentially quickly as we move away from zero. This "exponentially quickly" means that it is often quite small, but it also happens to exactly match the requirements we had up above for differential privacy. More specifically, we want its probabilities to drop off at a rate of exp(epsilon) for each unit from zero. The Laplace distribution with parameter 1.0/epsilon does exactly that!

So, do we get accurate counts or what? Not exactly accurate, as mentioned, but the probability that the count is wrong by more than c/epsilon is bounded by something like exp(-c). So maybe you don't get to know the count to within +/- 1.0/epsilon, but the bounds aren't much fuzzier than that. If epsilon is 1/10, you should expect the count not to be more than 100 wrong.

100 is enormous! So many, um. units!

This is where the randomness bites you, and you need to figure out if you can deal with it. In many cases, if you are analyzing a large population having your counts off by 100 isn't a big deal. If you are doing surveys of large populations, you often find more error in the random subset of people you picked as subjects. Actually, let's look at that in more detail.

Imagine we have a large population where some of the people are A and some of the people are B. We want to get accurate information about what fraction are A and what fraction are B, so we ask a lot of people; maybe 1,000 people. Let's pretend that 40% of the population are A: if we do this experiment many times we can get different answers; we won't always get 400 people of type A.

One easy way to visualize a distribution is just to draw a bunch of samples from it, then sort and plot them. This is an empirical form of the cumulative density function, which is a fancy way of saying "what fraction of samples land below various thresholds". We expect to see curves that start at zero (because no samples are smaller than the smallest value), end up at one (because all samples are smaller than the largest value), and only go upwards from left to right. The shape of the curve tells us where we find most of the samples: a very steep segment indicates that lots of samples suddenly came into play, a flat segment means no samples, and a sloped segment indicate a uniform distribution of samples.

So let's look at the cumulative density functions for those A and B people. We'll sample 1,000 people and compute the A count, repeating 1,000 times so that we can draw a smooth looking line.

cdf

As expected, we see a sharp increase at around 400, which is good because we really don't expect to see values too far away from that value.

Let's take a different view and ask about the distribution of error. When we draw a sample, how different is it from 400? This is just taking those numbers up there and sliding the x-axis around, which isn't all that hard.

cdf

Now let's do the same thing with the Laplace distribution with parameter 10.0. This corresponds to an epsilon value of 0.1, which is not such horrible privacy guarantees (pretty good, actually). To make the point clearer, I've just plotted these samples on top of the plot just above, for the inherent randomness of the sampling process:

cdf

So, the error we are introducing for 0.1-differential privacy isn't actually worse than the error inherent in this particular measurement process. In this particular case, it seems like it might actually be better. The concentration is tighter near the correct answer, and a bit looser out in the tails (which should make sense with folks familiar with the Laplace and Gaussian distributions).

So we can count things, and get answers that are probably within the allowed error tolerance for people who count things.

Something more sophisticated

Counting seems pretty trivial. In fact, lots of statistics basically boils down to counting. A distribution is just a statement about how many counts of things you have (or expect to have) at various locations. Actually statistics is more complicated than this.

What you do often see in statistics are accumulations (sums, often) where the contributions from each individual sample is bounded. This makes some sense: if one respondent could blow the statistic out of the water, any number of reasons could cause the result to be tainted: surprising out-liers, data corruption, mischief generally.

Fortunately, the Laplace noise addition above applies to any summation where each datum contributes at most one to the sum. A count has that property, right? It turns out that this relaxed property is good enough (actually, we can relax it even more). Let's think about some summations where each datum contributes at most one to the sum.

Important: for these experiments, all of our data are going to use numbers between -1 and 1. It is important that they lie in this range, because if not we must either: i. clamp them to the range, or ii. give worse privacy guarantees for large entries. Now, you might say "that's a bit of a cheat!" but if your numbers lie between -b and b, mentally divide them by b. If you can't think of such a number b, wait for a section or two (when we do median).

So, things we could sum up:

  1. The number of data elements is just the count, which we already did.

  2. The average (or mean, or expectation) of the numbers is their sum, divided by their count. We already have the count! The sum of numbers from -1 to 1 is one of these sums that can change by at most one when we add or remove a record.

  3. The second moment (no other name, sorry) of the numbers is the sum of their squares divided by their count. The square of a number that lies between -1 and 1 is also at most one, so we can sum them up.

    The second moment is interesting (perhaps) because you can combine it with the mean to determine the variance of the data.

    variance(data) = 2nd moment(data) - mean(data)^2
    

    The variance is interesting because its square root is the standard deviation, a measure of how scattered the data are around their mean.

Let's take a moment to look at some measurements. Imagine we have data between -1 and 1. Perhaps this is some metric for how predisposed you are towards chocolate. So, let's say that two thirds of the data are uniform on [0, 1] and one third are uniform on [-1, 0]. Apparently people like chocolate.

We have already seen how the inherent randomness in counts relates to the randomness introduced by differential privacy. Let's look at how the inherent randomness in the standard deviation relates to the standard deviation measured by i. computing the second moment with noise, and ii. computing, squaring, and subtracting the mean of the data.

This raises an important point about differential privacy: as you ask more questions, the privacy guarantee degrades. If you ask two epsilon-differentially private queries, the two together only guarantee 2 * epsilon-differential privacy. It's annoying that privacy degrades, but it makes some sense. What is appealing (to me at least) is that it degrades in a controlled fashion, as opposed to suddenly collapsing entirely.

Sequential composition: If you perform two (or more) differentially private computations, they collectively provide differential privacy with parameter bounded by the sum of the parameters you used.

To do our experiment we will want to compute the count, the mean, and the second moment each with privacy parameter epsilon/3, which triples the amount of noise we will introduce. Oh no! Well, let's see how it looks, anyhow. Here are 1,000 samples of the standard deviation, where I have subtracted off the mean value to center the error at zero, and 1,000 of 0.1-differentially private measurements of the standard deviation, also with the true value subtracted off.

cdf

Because I am brave, I am writing the text before doing experiments. I will say that while the contrast isn't as clear as with our counts up above, the error from differential privacy is not drastically worse than the inherent randomness of the standard deviation statistic. TODO: UPDATE; APPEND-ONLY.

Having done the measurements, I think it is fair to say that the error from differential privacy is absolutely worse than the inherent randomness of the standard deviation statistics. But, that's fine. We are providing good privacy (0.1) on a small-ish dataset (1,000 records), and the error isn't yet a complete disaster (coming up!). Instead, let's use this as a good example that if you have good privacy and limited data and don't try too hard, you get to eat some error. Omnomnomnom.

Let's do a final sum, so that we can reach an interesting goal: the Pearson correlation coefficient.

  1. The product moment of a collection of pairs of numbers is the sum of their products, divided by their count.

    The product moment is interesting, because if you subtract the product of the means of the coordinates, you get the covariance. The variance up above was just a special case where the two numbers in the pair were the same. Also, if you divide the covariance by each of the standard deviations (one for each coordinate) you get the Pearson product-moment correlation coefficient.

We'll look at some data for the correlation coefficient, but let's put it together as an actual routine.

Can you show me how to do this?

Let's look at some code! Everyone likes code. It reveals all the dark, horrible secrets, like whether ln or log is the right logarithm to use (I can never remember).

First, let's write something to generate Laplace noise.

// generates a sample from the Laplace distribution by taking
// the ln of a (0, 1] random number, then choosing a random sign.
fn laplace(scale: f64) -> f64 {
	let mut rng = rand::thread_rng();
	scale * (1.0 - rng.next_f64()).ln() * if rng.gen() { 1.0 } else { -1.0 }
}

Let me put on "serious face" for a moment and say: "this code is for demonstration purposes only". Ilya Mironov had some good points about how generating noise too casually doesn't really fill in all the bit patterns you need to fill in, and computers are pretty good at look at low order bits, even if humans are not so good at seeing them when we plot things. Use fixed-precision arithmetic, and generate Laplace random variables using independent Bernoulli variables for each bit.

Now let's write a helper routine to make taking noisy sums easy!

trait NoisySum {
	fn noisy_sum(self, epsilon: f64) -> f64;
}

impl<I: Iterator<Item=f64>> NoisySum for I {
	// computes a sum with epsilon-differential privacy,
	// thresholding elements as appropriate.
	fn noisy_sum(self, epsilon: f64) -> f64 {
		let sum = self.map(|x| if x < -1.0 { -1.0 } else { x })
					  .map(|x| if x >  1.0 {  1.0 } else { x })
					  .sum::<f64>();

	  	sum + laplace(1.0/epsilon)
	}
}

This method takes an iterator over f64 values, presumed and enforced to be between -1 and 1. It sums them up and adds Laplace noise with parameter 1.0/epsilon. Notice that we are taking action to ensure that they terms are bounded, because otherwise it doesn't have differential privacy. We do not want to rely on assumptions about the data to get our guarantees. Instead, faulty assumptions turn in to poorer accuracy, not poorer privacy (you can fix accuracy; you can't fix privacy).

What might we do with such a delightful subroutine? Well, how about the Pearson correlation coefficient up above?

// The Pearson product-moment correlation coefficient of two distributions X and Y
// can be computed as:
//
//   (E[XY] - E[X]E[Y]) / (E[X^2] - E[X]^2)^{1/2} (E[Y^2] - E[Y]^2)^{1/2}
//
// which we do here with epsilon-differential privacy.
fn ppmcc(data: &[(f64, f64)], epsilon: f64) -> f64 {

	// count the number of records, for the denominator of our expectations.
	let n = data.iter().map(|_| 1.0).noisy_sum(epsilon/6.0);

	// determine first moments by summing terms (with noise) and dividing by n.
	let ex = data.iter().map(|&(x,_)| x).noisy_sum(epsilon/6.0) / n;
	let ey = data.iter().map(|&(_,y)| y).noisy_sum(epsilon/6.0) / n;

	// determine second moments by summing products (with noise) and dividing by n.
	let ex2 = data.iter().map(|&(x,_)| x * x).noisy_sum(epsilon/6.0) / n;
	let ey2 = data.iter().map(|&(_,y)| y * y).noisy_sum(epsilon/6.0) / n;

	// determine product moment by summing products (with noise) and dividing by n.
	let exy = data.iter().map(|&(x,y)| x * y).noisy_sum(epsilon/6.0) / n;

	// return the Pearson product-moment correlation coefficient.
	(exy - (ex * ey)) / ((ex2 - (ex * ex)) * (ey2 - (ey * ey))).sqrt()
}

That was so easy! Except for the parts about statistics, which whatever, right?

Let's take a peek! Let's take our population of A and B people, and indicate A people by a value of -1 and B people by a value of 1. Now let's randomly give each person some predispositions for chocolate, independent of their type. We should see no correlation.

cdf

If instead A people always like chocolate and B people never like chocolate, we see should something else: a strong negative correlation (between "being B" and "liking chocolate").

cdf

Ok, we don't really see that, do we? We have enough questions here (six) that we have to scale down the privacy parameter 0.1 quite a ways to get 0.1-differential privacy. We could let the privacy parameter drift upwards, or survey some more people. Let's say we ask 10,000 people instead. The privacy guarantee for each respondent stays the same, but the accuracy improves (here both distributions are on the same plot, because they don't interfere as much)

cdf

Other computations

Adding up numbers is fun, but we can do a lot better if we think a bit harder. In this section, we'll look at a totally different way of doing differentially private computations, and find that we can get pretty amazing concentration for some statistics. Specifically, we are going to look at computing the median using something called "the exponential mechanism".

The exponential mechanism is actually pretty easy to describe. Analyzing its behavior can be a bit harder, but we can always just run it and see what happens.

To whip up an instance of the exponential mechanism, you need to define two things:

  1. You must identify a possible set of output values.

    In this case, lets use "32 bit integers" to represent the range from -1 to 1 (so we are mentally dividing our numbers by 2^31).

  2. You must provide a scoring function from (dataset, output value) to the real numbers with the property that it changes by at most one when you add or remove an input element.

    In this case we will use the function that measures how balance the value is at partitioning our data.

    score(data, val) = -||data > val| - |data < val||
    

    Notice that negative sign; the score penalizes val by the discrepancy in partitioned sizes. Also notice that adding or removing a record can only change this score function by at most one, because the record lands in one of the two parts, and changing either's size by one changes the score by at most one.

With these two features defined, the exponential mechanism selects an output val with probability proportional to

Pr[exp_mech(data, epsilon) = val] ~ exp(epsilon * score(data, val) / 2)

Run this way, the exponential mechanism provides epsilon-differential privacy. This is almost directly true: we rig the probability of picking each output to only increase or decrease by at most 1/2. We could have dropped that / 2 up there, but the distribution needs to get normalized by something, and that normalization can also change by the same small amount when we add or remove a record.

At the same time, the exponential mechanism does a great job of picking outputs that score well. The probability it picks any one output who's scores is less than the best score by diff is exp(-epsilon * diff / 2). For example, in the case of the median, the probability it produces any one output that splits the data into parts that are more than 100 different in size is less than exp(-epsilon * 50).

Now, there are probably lots of values that split the data badly, and each of them get an equal shot. So, we also need to worry about their number. However, their number bumps up against an exponential discount on probability, which means that their number interacts with diff only after having ln applied to it (their number).

In the case of the score function up above, it is really easy to evaluate the exponential mechanism as well. The probabilities are piecewise linear: we just sort the data, and each interval has exactly the same probability. We just need to figure out the probability for each interval, multiply by the number of outputs in the interval, and then sample an interval and a point within it.

Let's take 1,000 points uniform between -1 and 1, represented as i32 values, and look at the distribution of outputs returned by the exponential mechanism with epsilon = 0.1.

cdf

This looks pretty good. Again our measurement introduces error that is roughly the same as the inherent randomness of the statistic itself, so we don't feel too bad.

The measurement curve looks a bit herky-jerky though. What is going on?

Moar concentration

It turns out that the exponential mechanism is roughly following the contours of the data. Output values become less likely as we have more data points, rather than just because we move away from the correct answer. This seems ... irrelevant perhaps, until we think about much more tightly concentrated data.

Let's pull all of our 1,000 samples in the interval [0.123, 0.124] and use the same mechanism; we won't tell it that the data aren't just in -1.0 to 1.0. The median should be around 0.1235. We have to zoom in to see the detail, so pay attention to the offset and scale of the x-axis.

cdf

Wow. Quite concentrated.

The samples mostly all lie within an interval about 0.00006 wide. The data were only concentrated in an interval 0.001 wide, and we took the measurement as if the data could be anywhere in the -1.0 to 1.0 interval. And we maintained 0.1-differential privacy, using only 1,000 samples. Pretty neat.

Even moar concentration

Let's look at another example, where the points are all exactly one value (maybe 0.4). In this case, every output other than exactly 0.4 splits the data very badly, and incurs a penalty of -|data|. If we have 1,000 samples, and that 1,000 goes up in an exponent, we could imagine something being very unlikely. Let's count it out:

There are 2^32 possible output values, which is about exp(22.18). All but one of these (0.4) get a probability proportional to exp(-epsilon * 1,000 / 2) which with epsilon = 0.1 is exp(-50). So the cumulative probability for non-0.4 outputs looks like exp(-27.82) at the moment. Technically, it needs to be this divided by one plus that, for normalization, but ... that is basically just the same number.

I'm afraid I didn't bother to plot the cdf for this one.

So, an exp(-27.82) chance of producing anything other than the true median. That's crazy concentrated! It gets a bit more concentrated if there are a few different nearby numbers (the 2^32 term hurts less), but whatever. It is really very concentrated.

Generalizing

There is a neat generalization of what we did above to a larger class of statistics. It doesn't always work as easily; median happened to be easy to compute, and computing things can be harder than defining them. That being said, for any statistic you have in mind, you can define an instance of the exponential mechanism using:

  1. The set of possible output values as the possible values of the statistic (whatever range it takes).

  2. The score function as

    score(data, val) = -"how few records in data must change to make stat(data) = val?"
    

This score function penalizes output values that would require large changes to the data to cause the value to come into effect. In the case of median, we need to change large amounts of data to get a far away median. Other statistics are less robust, but you still get useful computations.

For example, you can do "average of -1 to 1 values", like we did up above with Laplace noise, but here get much tighter concentration around the limits: if the average value is currently -0.99 due to all points themselves being -0.99, then you may have to change a very large number of points to get the average to -0.999; at least 10% of them. To get the value to -1.0 you'd have to change all the points, which ends up with even more penalty. Despite -0.999 and -1.0 being somewhat close to -0.99, they end up with exponentially small probabilities in the size of the dataset (in this case).

It is possible, though I don't know the answer, that you could apply this to the problem of the Pearson correlation coefficient that we studied above. It has a valid range of -1.0 to 1.0, and if you impose a bound on the magnitude of the records, you should also get concentrated results that are especially concentrated at the limits of its valid range. Plus you bypass the factor of six we ate indirectly computing all those moments.

The hard part for general statistics, unlike median and average, is determining how few records you need to change to reach a given value of the output. Is it piecewise linear and monotonically increasing as you move away from the true value? Can the break-points be easily determined? I don't know the answers to these, even for the correlation coefficient, but they seem like good questions to ponder!

Research, ho!