Work and World

Protecting Privacy With MATH (Collab With The Census)

Disclaimer: This video was produced in collaboration
with the US Census Bureau and fact-checked by Census Bureau scientists; any opinions
and errors are my own. Every ten years the US Census Bureau surveys
the American population – the ambitious goal is to count every person currently living
in the entire United States of America and collect information about them like age, sex,
race and ethnicity. The whole purpose of doing surveys like the
census (and many other big medical or demographic surveys) is to be able to get an overall,
quantitative picture of a particular population – how many people live in Minnesota? Or Mississippi? What’s their average age? And how do these things differ in different
places, or by sex, or race? The results of the US Census are of particular
political relevance since they’re used to determine the numbers of seats that different
states get in the US House of Representatives as well as the boundaries of legislative districts
from Congress down to city councils, but big surveys are also useful for understanding
lots of other issues, too . The problem, of course, is that the Census (like many other
medical and demographic studies) is supposed to be private. Like, no one outside the Census Bureau is
supposed to be able to look at just the published statistics about the US population demographics
and definitively figure out that there’s a white married male 31-year old with no kids
living in my neighborhood (that’s me). The census bureau is supposed to keep my information
confidential. And they’re supposed to keep the information
of every single other person living in the United States confidential, too. Which is a tall order, because how can you
keep everyone’s information entirely confidential while still saying anything at all based on
that information? The short answer is that you can’t. There’s an inherent tradeoff between publishing
something you learn from a survey and maintaining the privacy of the participants. It might seem like you could just remove people’s
names from the spreadsheet, or only publish summaries like averages and totals. But it’s easy to reconnect names to datasets
using powerful computers, and there’s a mathematical theorem that guarantees that
if you do a study, every single piece of accurate information that you release, however small
it seems, will inherently violate the privacy of the participants to some degree violate
the privacy of the participants in that study to some degree. And the more information you publicly release,
the more you violate the individual privacies of the participants. But how do you quantitatively measure something
nebulous like loss of privacy, and then how do you protect it? To understand how to measure privacy, it’s
helpful to start by imagining how somebody would try use published results (from a study)
and piece together the private information of the people surveyed. They could just try to steal or gain direct
access to the private information itself , which, of course, can’t be protected against mathematically
– it requires good computer security, or physical defenses, so we won’t consider it here! The kind of privacy attack we can defend against
mathematically is an attack that looks at publicly published statistics and then applies
brute force computational power to imagine all possible combinations of answers the participants
could have given to see which ones are the most plausible – that is, which ones fit the
published statistics the best. Imagine checking all possible combinations
of letters and numbers for a password until one of them works, except instead of letters
and numbers it’s checking all possible “combinations-of-the-answers-that-330-million-people-could-give-on-their-census-questionnaires” to see which combinations come closest to
the publicly published figures for average age, racial breakdown, and so on. The more closely a potential combination of
answers matches the published figures , the more promising a candidate it is (from the
attacker’s perspective). The more poorly it matches, the lower their
level of certainty. As a small example, if there are 7 people
living in a particular area and you tell me that four are female, four like ice cream,
four are married adults, three of the ice cream lovers are female, and if you also give
me the mean and median ages for all of these categories, then I can perfectly reconstruct
the exact ages, sex, and ice cream preference of everyone involved. I would start with the 3 ice cream loving
females; even though there are hundreds of thousands of possible combinations of ages
for three people, only a small fraction of those – 36, in fact – are plausible – they’re
in the right combination to give a median age of 36 and a mean age of 36 and two thirds. And the same thing works for the four females
overall – there are almost 10 million possible combinations of ages they could have , but
only 24 age combinations that are consistent with a median of 30, a mean of 33.5, AND with
at least one of the plausible age combinations for the three ice-cream lovers. Continuing on with this kind of deduction
leads to a single plausible (and perfect) reconstruction of all of the ages, sexes,
and ice-cream preferences of the people involved; a 100% violation of privacy. If, however, you didn’t list how many of
the ice cream lovers were female, there would instead be two plausible possibilities, so
I would be less certain which was the true combination of ages and genders and ice cream
preferences. And the potential level of certainty of an
attacker is precisely how we measure the loss of privacy from publishing results of a study. If all possible combinations of ages and sexes
and so on are similarly plausible, then an attacker can’t distinguish between them
very well and so privacy is well protected. But if a small number of the possibilities
are significantly more plausible than the rest, they stand out – and precisely because
they stand out on plausibility, they’re also likely to be close to the truth. So to protect privacy, all possibilities need
to seem similarly plausible, or at least there can’t be plausibility peaks that are too
conspicuous. The potential for plausibility peaks is quantified
mathematically by measuring the maximum slope of the graph – if the slope never gets too
steep, then you can’t have any sharp peaks of highly plausible possibilities that stand
out.But how do we publish statistics in a way that limits the maximum slope (and possible
peaks) on the plausibilities plot? In practice, the best way to limit an attacker’s
ability to confidently choose one scenario over the other is to randomly change, or “jitter”,
the published values. Like, for example, rolling a die and adding
that number to the average age reported for ice-cream lovers. Jittering the published results in a mathematically
rigorous way puts a limit on the slope of the plausibility graph, and thus makes it
harder for any particular possibilities to stand out above the rest. Jittering results might also seem like lying,
but as long as the size of the adjustment isn’t big enough to make any significant
changes to conclusions people draw from the survey, then it’s considered worth it for
the privacy protection. For example, imagine I want to give you a
sense of my age while keeping my true age secret. If I just told you my age, obviously there’s
just one plausible possibility – 31! But suppose instead that I secretly pulled
a number between minus 5 and 5 out of a hat and added it to my age before telling you
. In this case, all you know is that my true age is somewhere within 5 years of the number
I told you, but you don’t know my age exactly. My privacy has been preserved, though only
to a certain degree because you can be confident I’m not 20 and not 40. To protect my age more, I’d have to pull
a number between, say, -10 and 10 out of a hat and add it to my age – this increases
the number of plausible possibilities – that is, the possible true ages that COULD have
resulted in the number I told you. It also increases your uncertainty about my
actual age – the tradeoff for privacy is inaccuracy. If I wanted you to know my age within a year,
I could only pull a number between -1 and 1 out of the hat.In general, the idea is this:
more privacy means you get less accuracy . Less privacy means you can have more accuracy . When
you publish results, hopefully there’s a sweet spot where you can share something useful
while still sufficiently maintaining peoples’ privacy. And simultaneously maintaining decent privacy
and decent accuracy gets easier and easier with larger datasets. Like how as I add more noise to this image,
you can still get the general picture even once you’ve lost any hope of telling the
true original value of a particular pixel. So, to protect people’s privacy, we can
and should randomly jitter published statistics (which the US Census, for example, has been
doing since the 1970s). However, there’s a subtlety – you can’t
just add any old random noise however frequently you want – if I simply add different random
noise to this picture a bunch of times different times, once you take the average of all of
the noisy images you basically get back the original clean image – you don’t want this
happening to your data. So, there’s a whole field of computer science
dedicated to figuring out how to add the least possible amount of noise to get both the most
privacy and the most accuracy, and to future-proof the publication of data so that when you publish
multiple jittered statistics about people, those statistics can’t be combined in a
clever way to reconstruct peoples’ data. But up through the 2010 census, the Census
bureau couldn’t promise this – sure, they were jittering data published in census bureau
tables and charts, but not in a mathematically rigorous way, and so the Census bureau couldn’t
mathematically promise anything about how much they were protecting our privacy (or
say how badly it’s been violated). Until now! The US 2020 Census will, for the first time,
be using mathematically rigorous privacy protections. One of the biggest benefits of the mathematically
rigorous definition of privacy is that it reliably compounds over multiple pieces of
information – like, if we have a group of people and publish both their average age
and median age, each with a privacy loss factor of 3, then the privacy loss factor for having
released both pieces of information is at most 6. So you can decide on a total cumulative amount
of privacy loss you’re willing to suffer , and then decide whether you want to release,
say, 10 pieces of information each with 1/10th that total privacy loss (and less accuracy),
or if you want to release 1 piece of information with the full privacy loss and a higher level
of accuracy.But how much privacy we need is a really hard question to answer. First, it involves weighing how much we as
society collectively value the possible benefits from accurately knowing stuff about the group
we’re surveying vs the possible drawbacks of releasing some amount of private information. And second, even though those benefits and
drawbacks can be mathematically measured as “accuracy” and “privacy loss”, we
still have to translate the mathematical ideas of “accuracy” and “privacy loss” into
something that’s understandable and relatable to people in our society. That’s partly a goal of this video, in fact! So let’s give it one more shot at a translation.First
and foremost: it is in principle impossible to publish useful statistics based on private
data without in some way violating the privacy of the individuals in question. And if you want to provide a mathematically
guaranteed limit on the amount of privacy violation, you have to randomly jitter the
statistics to protect the private data.The accuracy of the information after being jittered
is generally described probabilistically, by saying something like “if we randomly
jittered the true population of this town a bunch of times, 98% of the time our jittered
statistic would be within 10 people of the true value.” So accuracy has two components: how close
you want your privacy-protected statistic to be to the real answer , and how likely
it is to be that close. The loss of privacy due to the publication
of information is described in terms of how confidently an attacker would be able to single
out a particular possibility for the true data the plausibility of different possible
true values for the underlying data. Given the published information, are there
just a few possibilities for the true data? Or are there many, many, plausible possibilities
for what the true data might be? Essentially, loss of privacy is measured by
the prominence of peaks on the plausibility plot. And so the protection of privacy requires
policing the possibility for such peaks. If we individuals are going to willingly participate
in scientific or other studies and surveys or use services where we reveal potentially
sensitive personal information, we should really demand that the researchers or organizations
utilize a mathematically robust way of protecting our privacy. Simply put, if they can’t guarantee there
won’t be a peak in plausibility, then we shouldn’t agree to give them a peek at our
Thanks to the U.S. Census Bureau for supporting this video. The founders of the US understood that an
accurate and complete population count is necessary for the fair implementation of a
representative democracy, so a regular census is required by/enshrined in the US Constitution. The US 2020 Census will be the first anywhere
to use modern, mathematically guaranteed privacy safeguards to protect respondents from today’s
privacy threats. These new safeguards will protect confidentiality
while allowing the Census Bureau to deliver the complete and accurate count of the nation’s
population. They will also give those who rely on census
data increased clarity regarding the impact that statistical safeguards have on their
analyses and decision-making. In short, the Census Bureau views the adoption
of a mathematical guarantee of privacy as a win-win.Here’s how the chief scientist
at the Census bureau thinks about it: there is a real choice that every curator of confidential
survey data has to make. If they want the respondents to trust them
to protect confidentiality, then the curator has to be prepared to give (and implement)
mathematically provable guarantees of privacy. Unfortunately, this means there’s a constraint
on the amount of information you can publish from confidential data. It’s mathematically impossible to provide
perfectly accurate answers for as many questions or statistics as you want while also protecting
the privacy of respondents. So curators need to do two things: understand
the needs and desires of the people who provided data and the people who want to use the data
in order to determine precisely what balance of accuracy vs privacy to choose, and then
not waste that limited privacy budget by publishing accurate answers to unimportant questions.
Video source:

Related Articles

Back to top button