How to protect privacy in a world awash in data
This week’s big news had to do with a heat map published back in November by a fitness tracking application called Strava. A 20-year-old in Australian noticed that the running data from U.S. military personnel indicated where clandestine bases were in Syria. His insights percolated through security analysts on Twitter, and then to the U.S. Department of Defense.
Now the DOD is re-evaluating its policies around wearables and mobile phones, and will likely look at the social media habits of its soldiers as well. What happened with Strava is nothing new, exactly. On a smaller scale, hackers and spies have used public social media profiles to get all kinds of information on targets.
But there are two things that are different about the Strava case—and worth noting. The first is the scale of it. The second is how two types of data were combined to create new insights. Strava helpfully showed data from more than a billion activities which, when combined with the map, created a clear picture for those who knew what they were looking for, and disclosed more than Strava intended.
Inadvertently disclosing new information will be the new challenge of our age as we connect ourselves and our things to the internet. Each of us will leave ever-larger digital footprints, which can be combined in various ways to provide new information, all of which will be searchable to anyone with an internet connection and an interest.
Short of hiding in a bunker, wrapping your phone in foil, and ditching social media, what is a person — or a concerned employer — to do? The short answer is we don’t know. Even fully grasping the problem is tough. There are several aspects to it.
Most importantly, there’s an increasing amount of data about individuals online that’s fairly easy to get. Then there’s an increasing amount of data about that data, so-called metadata, that’s also easy to find (or subpoena). For example, if your tweets are data, then the location data attached to them are metadata. And this data can now be combined in new ways. In this week’s podcast, privacy analyst Chiara Rustici called this a “toxic combination.”
Finally, once data is out there, it can be reused, repurposed, and reformulated to help draw new conclusions and meanings that were never intended. Imagine if that permanent record your teachers threatened you with back in school were real. In this new era it effectively is.
That’s just the data challenge. There’s also an economic challenge. Data is incredibly cheap. Which means getting data and metadata and creating these toxic combinations is also incredibly cheap. It’s also seen as incredibly valuable to corporations, which is why everything from your toothbrush maker to your coffeepot is trying to snag as much information as it can.
Data may be cheap to get and hold economic value, but it’s also expensive and difficult to secure, which means bad actors can get a hold of your social security number and credit cards with what feels like relative ease. And yet, when data breaches happen the individual is left to pay the inevitable costs as they try to restore their credit, deal with financial fallout, or recover embarrassing secrets.
There’s a link from Strava’s disclosure of military secrets to revenge porn, and it runs through the internet and its ability to make getting information easier than ever. And it relies on our increasing ability to digitize anything from our running routes to our photos.
We’re intellectually aware of all this, but whenever it comes time to do something about it, we throw up our collective hands and keep snapping our naked pics. There are few existing weapons to solve this problem, so let’s take a look at what they are and where they fall short.
Opt-ins and transparency: Many of our apps and devices come with a variety of privacy settings that can range from simple — share or do not share — to byzantine. Strava’s were apparently byzantine, which didn’t help folks that wanted to stay off the heat map. But good privacy settings can only go so far. They don’t stop hackers from accessing data and they also don’t stop toxic combinations of data.
Differential Privacy: Apple made this privacy concept famous. Essentially all data collected gets anonymized and injected with random noise to make it hard to recombine it and determine to whom the data refers. This is good for individuals, but it requires technical overhead and that the company do it correctly. Apple’s talked a good game, but researchers looking at its implementation say it left a lot to be desired. The other challenge is that you can still glean a lot of information from anonymized data. Note that none of the Strava folks were identified.
Collect only what you need: This idea is simple. If you are making a device or app, don’t collect more data than you need. For example, the Skybell doorbell doesn’t keep a user’s Wi-Fi credentials after getting set up on the network because it’s not information the company needs. Most other connected devices don’t share that view, however, which led to LIFX bulbs leaking a bunch of Wi-Fi credentials a few years back. Whoops.
This is a tough issue because in many cases companies collect all this extra data in case they might need it someday. And thanks to improvements in machine learning, they may not be wrong. Applying machine learning to random data sets can yield new insights that could improve the service.
Regulations: All of the above are voluntary things that companies can do as a step toward protecting user privacy, or letting users have more say in how their data is used. But the strongest tools to protect privacy will come from regulatory pressure. This year, the world is about to get a massive amount of regulatory pressure in the form of the General Data Protection Regulation. This regulation was passed by the EU in 2016 and goes into effect in May. It acts as a safeguard for data. It enshrines some of the above items, such as needing a reason to collect a piece of data and providing transparency, but it also goes a lot further.
For example, it allows an individual to ask what a company knows about them, forces the company to correct wrong information, and requires the company to dump the user’s data upon request. It also prohibits profiling on the basis of data. These are only some of the regulation’s provisions, but in my conversation with Rustici, it became clear that the GDPR is so forward-looking that from a technical standpoint, we don’t have ways to actually implement some of these provisions yet.
For example, the ability to retract your permission to use data sounds good, but once that data is sold to a third party or combined to create new insights, how can that data be controlled? How can the new knowledge go away?
So while privacy is a huge challenge and one that we’re still wrapping our arms around, we also need to build tools to track each piece of data about us. Maybe even each piece of metadata. Then we need ways to claw that data back. All of this has to be scalable, which leads me to look to something like the blockchain as a way to track data.
We also need to develop a far more sophisticated understanding of what is known about us and how that knowledge can be applied. Which means that companies creating fun blog posts or heat maps based on a wide array of anonymized data should carefully consider how that information could be used.
We keep saying that data is the new oil, but oil is not a wholly harmless substance. We need to accept that data isn’t, either.