Clean Insights: Privacy-Preserving Measurement

About this Episode

Nathan talks about Clean Insights, the old/new work getting underway on privacy-preserving measurement.

Clean Insights gives developers a way to plug into a secure, private measurement platform. It is focused on assisting in answering key questions about app usage patterns, and not on enabling invasive surveillance of all user habits. Our approach provides programmatic levers to pull to cater to specific use cases and privacy needs. It also provides methods for user interactions that are ultimately empowering instead of alienating.

Follow along with Nathan as he talks through the Clean Insights Overview Presentation

Music courtesy of MusOpen - Delius “2 Pieces for Small Orchestra”

Automated Transcript

This is an automatically produced transcript so apologies for any typos, glitches or other unexpected outcomes!

01:18 Hi this is Nathan Freitas and welcome to another edition of engard the guardian project podcast. This is our place where we share the work we have going on new ideas updates of old ideas progress, we plan to make in the coming days weeks and months and just generally everything we’re thinking about when it comes to mobile security privacy human rights workers rights journalists rights and the ability to have you know, dignity in your own communication when you’re using your personal portable computing device.

01:51 It’s may first happy may day to all of you international essential workers out there and everyone who’s just trying to make it by another day through these difficult times. We’re persevering here as well in our work and trying to keep on track with some of the very important and what we feel is essential within our own realm work that we’ve already had underway and want to continue.

02:18 One of these areas of work is around what we call privacy preserving measurement. This might be considered part of a larger field called ethical analytics, which could be under the idea of big data. It’s a tricky subject for us because. In large part, we don’t do analytics or big data or measurement we barely log our website traffic we have actively work against work against the idea of tracking users in any way.

02:55 And especially against kind of gathering raw data in the way that most commercial analytics platforms do. And so it’s taken us while to figure out an approach to. Trying to understand how well we’re doing in the world and impacting people to find an approach that we like that could be useful that could be compatible with our worldview around privacy and security.

03:21 Now about three years ago, I was part of. A 12-week hackathon, you can think of it as called the Assembly. Project at run by the Berkman Client Center at Harvard. Where I’ve been a fellow an affiliate fellow now and assembly is this this. Time where we bring people from industry and academia and open source hackers and makers and we try to focus on our problem for this extended period of time together and find some new approaches.

03:56 And the year that I was in assembly there was a focus on security and kind of mass insecurity of internet of things of mobile devices personal devices and what could we do to address these things? And that was fairly broad but one thing that my group came up with was the idea that a lot of the insecurity was related to surveillance that was happening on users.

04:21 That seemed necessary from commercial product standpoint but was causing actually a lot of harm. And so out of that realization, we created something called clean insights and focused on this concept of privacy preserving measurement. And that’s what I’m sharing with you today in this talk slash podcast talk real talk.

04:45 And, There’s a slide presentation that you can see linked here from the podcast hopefully and go through it as I’m talking about it. But you know getting into the first couple slides I the core problem is sort of set forth with guardian project, which is we think we’re a success.

05:02 We know how many people download our software because the app stores tell us. We hear anecdotes that people like what we do, but we don’t have any other way of kind of measuring positive effective uses or you know, we have ways to interview users to do kind of consensual focus groups to talk to organizations, but through the technology our itself we don’t have any way to measure.

05:31 Is this feature working? Are people successful in connecting or people using the app and certain amount of time and where they’re getting stuck? And of course, there’s commercial packages that enable. All of those things but we don’t feel comfortable kind of dropping in toolkits that log tons of raw data and move it to a third party for hosting.

05:51 But we do know that decision makers and developers and data scientists and people that make products need to understand effectiveness of their products and user happiness through a multiple methods. But we don’t want to do that at the cost of privacy and security and trust. So in the slides we have a few examples here that my team at assembly was a bunch of great folks and I’ll link to are assembly project there and hopefully you’ll get to meet some of them in the coming days as they get maybe more involved in this project again.

06:27 But we looked at May 2 at the time was a photo editing app that needed a ton of permissions to really for selling the user data. So why would a photo editing app need to know about your SIM card in your country and boot it started all of these things and was clear that they had some kind of tracking going on for monetization that was beyond what the app itself did.

06:48 We also saw that many times you could weaponize surveillance infrastructures through JavaScript kind of replacing JavaScript files and cause distributed denial service attacks on a scale of like Baidu scale. So like a billion people in China can now be weaponized through using insecure JavaScript. We saw the Tesla was using the way that they measure their cars and driving at you know, they say to improve the autopilot but in fact, they were using it to kind of cover their tracks when they were incidents that you know, they were taken to court and they can say oh no, you know, we can see everything that you did and even did this with a journalist who was reporting on the car and and saying, oh no, it was their test drive was wrong.

07:36 It wasn’t us. It can even be not just technology outside of you but inside of you that the sort of measurement goes wrong. Someone with the pacemaker that was logging data about their heart rate was trying to commit fraud which is not a great thing to do by lighting their house on fire but the police force them to extract the data from the pacemaker to look at the heart rate and see that it didn’t match up with the story of the person based on when they said they woke up or got out of the house.

08:09 So that data from their own body is used against them. And I’m sure whoever log that data thought of it for beneficial purposes, but then didn’t understand the potential harm to its owner that could be caused. Again criminals are done and don’t burn down your own house, but we also don’t want our bodies to be used against us.

08:29 Libraries have also you know been interested in sharing more potentially about what you do because it’s interesting to know which books you like, but then all of that data about what you’re reading can be used against you by federal agents under the Patriot Act. This is one reason it’s hard to see what your friends are watching on Netflix and having kind of social Netflix as a key feature video rentals for a long time have been protected in this way and so measuring what interests you have in terms of video rentals is extends to other areas.

09:02 Now fortunately, there’s a ton of great existing work in how to do privacy preserving measurement. Professor. Cynthia Dwork is one of these leaders in this area who created something called differential privacy and has you know published this work and as a way to say no single measurement if you take a ton of measurements across many people are and there are many things that you’re measuring and you introduce a certain amount of noise through this algorithm then no single measurement can necessarily be said to be correct because you don’t know where noise has been applied and how it’s been.

09:44 Applied in in aggregate the statistical outcomes are still accurate but you can’t then reverse each reading that you’ve taken into a specific let me see what this person or this reading was so this is an important way to have aggregate understanding or insights without exposing people to individual risk so Apple then famously applied this to understanding what the most popular emojis were that people were typing they didn’t want to log every tap on a keyboard.

10:15 And everything that you wrote but they wanted to know what other popular emojis and how can we make them more popular at different emojis and so they used differential privacy to ensure that the data was randomized and not associated with an Apple ID and had a number of other kind of pioneering ways to do privacy preserving measurement and Apple should get great credit for this.

10:36 Google is also done this with rapport for many years now almost six years through measuring Chrome users in the same way they want to understand how Chrome is being used without compromising the privacy of their users, they don’t want to track every single chrome use in the world if they’re not logged into a Google account and and it’s tricky with Google because clearly they do a lot of tracking but there’s cases like this where they just want to understand kind of.

11:02 Things around the performance of the browser without having tied to user data and a private privacy preserving reporting algorithm called rapport just based on similar concepts of differential privacy enabled them to do this.

11:19 So these things exist but even now a five years after they’ve come out, they’re not really available to developers to just utilize and plug in it’s really hard still. In the human rights and open source space, you know tour has actually really led with the most practical and public and longest kind of ongoing work of privacy preserving measurement really on funders of Tor wanted to understand well, where is it being used and tore themselves wanted to understand which countries they are there might be spikes in you know to detect certain what would be called censorship offense?

12:00 And so the tour metric system and how it adds measurement throughout the onion routing global volunteer to our network but adds in the right way and the right places so that you’re not leaking information about individual users and that you’re just measuring certain traffic at take country level for instance or you’re measuring throughput and bandwidth and not specific information about specific nodes in the torn network.

12:26 So the tour metrics site is really a great kind of gold standard when we come to the human rights and internet freedom space to emulate. We also have Mitomo which was used to be called P-wick which is a fully open-source self-hostable analytic system that has SDKs for websites and mobile apps and you have things like Acro for doing your own crash handling.

12:53 Right. So there are open source solutions that exist today. They’re fairly straightforward in terms of how they’re implemented and thus don’t awful lot offer a lot of privacy protection other than what you trust me because you’re using my app or website. So, it’s kind of a trust us model.

13:10 Though you only have to trust the people who service you’re using and not a third party, which is a great step and we really like what homeowner and other projects have done there.

13:22 What we heard of the clean insights project is that developers of things like secure messaging apps have questions. You know, they don’t necessarily want a rod dump of data but and when we’re one of these developers, you know, we want to understand if users like a change in the interface or how the battery impact is or the number of conversations people tend to have open and we want to understand this because we just guess if we don’t have the the this kind of insights into the way our apps are being used.

13:54 We’ve surveyed a bunch of users about also kind of you know, what their concern would be about analytics and the concerns about misuse to concerns about you know, law enforcement suddenly having access to this data if there’s a subpoena concerns about data not really being anonymous and and you know, could we really self-host this and would that be any more secure?

14:16 So there’s, People are interested in this there’s a lot of concerns. And there’s concerns from users about more tracking right rightly so again, even if they trust someone’s like well, why are you tracking me? You’re supposed to be the good guys. Don’t track me as well. But you know there is they they are interested in value and having the apps improved so we have to find a way.

14:41 And walk this tightrope between finding ways to understand how to improve our applications without having to expose the user to harm or cross a line where we would lose their trust. So that brings us to clean into clean insights, which is we do want developers to have a means to understand how to improve but do it in a way that respects privacy and security and you know, keeps the trust.

15:05 That’s that’s clear. Part of the way, we do this is to really think about threat modeling and understand all of the kind of vulnerable assets and things we might gather and what we don’t want to gather and what we don’t want to know. And so going through all the places and ways the data could be attacked in analytics and measurement system or or where your storing data is key and this is a whole long discussion to have but there are a lot of potential metaphy medications to the harms mitigations.

15:38 Things like differential privacy, you know, kind of cryptographic randomness and noise data minimization, you know edge-based data processing on the device itself, so you’re not just sending raw data hardening the way the network transports and how data is being sent using open source code using cell phones to code having limited retention policies, you’re only retaining data for a week or two weeks or some period to understand doing work with kind of GOIP averaging it to country regional levels like Torah has done.

16:13 There’s so much that can be done, it’s just heart still to do this. So with clean insights, you know, we have all this idea that companies treat this sort of data like gold this is really the valuable thing but we believe it’s toxic you know, it’s a toxic byproduct that maybe you need but you don’t want enough you don’t want to handle it and you have to be careful with it and you want to minimize it.

16:39 And so we really want to find ways to serve everyone who feels they need these kind of insights as toxic as they could be to make their products better and maybe over time their scene is less toxic but you know, we really don’t want to just be vacuuming up everything we want to be picky and choosey find the questions we want answered find ways to answer those questions provide technology to.

17:04 Answer those questions using measurement and make sure that data is kept secure on the device over the network and at the service post. So the idea of the three tenets of clean insights really providing hard insecurity right thinking about threat modeling providing goat sort of network security supporting tour onion routing for transporting measurement data supporting in a kind of SDK that would be incorporated off the shelf things like data batching for combining measurements and aggregating measurements on the device smart thresholds where you only measure, What you actually need to measure and when you need to measure once you’ve kind of hit a threshold, you know, no permanent cookies kind of rotating IDs for measurement.

17:55 And then this these advanced anonymity techniques around differential privacy randomized response and other things as they’re available private joining compute is a newer one so being able to just easily incorporate these things where they are possible and relevant is key. A final piece beyond these things is thinking about consent and right now in measurement and analytics you basically have a binary opt-in opt-out some question is it opt-in first is dropped out but it’s a one-time thing and it’s sort of yes or no and we think there’s a lot more potential for granularity here of saying hey we look like you notice something is going wrong in the app mind if we measure you for the next five minutes to get some data.

18:45 Or. Users in a saint similar geographic area can be sent to prompt that like we’ve heard that there might be some issues in this area. Can we run one quick measurement to see if you know reachability to the server is fine? Or kind of measuring app network health like a speed test dot and then over time giving some feedback to the user to understand how well the app is performing for them allowing them to tell us that.

19:12 This sort of time bound time-bound measurement as a threshold or performance-based measurement or you know, wait until users really invested an app and say wow you seem to really like this feature maybe you can help us make it better. These thresholds are really key and finding ways to make the users are partners in gaining these insights is also key instead of being guinea pigs or rats that were just kind of watching from afar.

19:38 So we’ve shipped code I mean that’s what you do at Guardian Project. We ship things. We make apps and solutions and SDKs and libraries and push them out into the world and we make lots of different things. And so what we’ve already done with clean insights is you know, a few years ago created a beta kind of MVP of what we thought this would look like that works with Matomo formerly p-wick integrates with our certificate pinning and onion routing code from net cipher SDK that allowed has all of these things implemented around thresholds of time.

20:13 And space that allows you to kind of store measurements and batch them up and then dispatch them at a certain time once you’re ready, so you’re not just sort of constantly sending blips back to a server but maybe once a week you are as a batch. We’ve implemented some of the using the report code the randomized encoder.

20:34 So you can add that noise and interior data before it’s sent to the server at all and be part of a differential privacy enabled data set. And we thought really about threat model and all the different things you might want to measure and each step and what kind of security is needed along the way.

20:52 So these things have come together in the clean insights SDK beta for Android that we have. That was developed a few years ago and we’ve been trying to get more funding since then to move it forward and we have now. We’ve gotten a small grant. We have funding to do a symposium meeting which were will be sharing more about soon.

21:14 And we’re working to bring our concepts to Android and JavaScript and Python as well hopefully. So I’m really excited to share that you know, clean insights.org now is is available as a website and that we have more content coming on this project and that will be sharing that through this podcast through the cleaning inside website and through our symposium extraordinaire that will be.

21:42 Happening in the coming weeks. So thanks for listening to today’s podcast check out clean insights.org. And yeah the door just opened which means my children are here homeschooling is in effect and I got to get back to parenting so hope all of you are well out there have a great weekend happy.

22:03 Mayday and keep your insights clean. Yeah, all right guys bye take care.

Keywords - privacy preserving measurement, ethical analytics