A central focus of the Tracking the Trackers project has been to find simple ways to detect whether a given Android APK app file contains code which tracks the user. The ideal scenario is a simple program that can scan the APK and tell a non-technical user whether it contains trackers, but as decades of experience with anti-virus and malware scanners have clearly demonstrated, scanners will always contain a large degree of approximation and guesswork. Tracking the Trackers grew out of experiments in using machine learning to detect malware. This provided the spark to apply this to privacy issues.
The malware research clearly demonstrates that network domain names and code signatures are quite reliable techniques for identifying malware. This also applies to tracking, since the majority of tracking happens via tracking companies’ SDKs which send data to specific domain names. The hard part is that code signatures and domain names are not easy to reliably extract, and are often easy to obfuscate when someone is looking to hide what an app is actually doing. This is common in malware, and we are also starting to see obfuscation in the world of tracking.
Android gives us a break with its AndroidManifest.xml. It is a hard requirement for Android apps so it is always there, it contains some key declarations that set up how the code is run, and it is easy to extract and parse. So we put extra effort in thinking about the data that is contained in the AndroidManifest.xml.
Towards the goal of simple scanners for tracking, we are excited by two new data sources that we found in the AndroidManifest.xml that are useful signals for automatically detecting tracking in Android apps: API Key Identifiers and BroadcastReceiver Declarations.
API Key Identifiers
Tracking services provide their customers with servers to submit the
data for processing and analytics. These are usually part of the
service’s API. A common pattern for publicly accessible network APIs
is to require the use of an API Key. This key grants access to the
service and provides an unique identifier for the customer so that the
submitted data goes to the right place. In order to submit the key to
the API, the key data must be identified to the server somehow. That
is the API Key Identifier. This is generally something that never
changes, since changing it could mean locking out all customers. For
example, Google Firebase
ga_trackingId as its API Key Identifier for many years. API
Key Identifiers are a great way to track trackers. They are tiny and
easy to extract. Most services require them. The entire set that we
have found is small enough to fit into a single machine learning
search space. And it is quite unlikely that an app would include them
by accident or without having set up a tracking service.
We also found some evidence of obfuscated API Key Identifiers, the source has not yet been identified. We found many API Key Identifiers that were not the same but matched a pattern. This pattern looks like it could be encoding some information:
In Android, apps and the
can publicly broadcast events, and any app can listen for these
events. Some of these events contain detailed information, like the
about which song is currently playing. Charging and battery status
can be used to
These broadcast events are generic Android
Intents which an app
registers a receiver by name in order to get the info when it is sent.
The specific pieces of interest are the
Like other bits in the AndroidManifest.xml, the BroadcastReceiver Declarations are easy to extract. Unfortunately, BroadcastReceiver Declarations are not nearly has definitive when it comes to marking tracking. They are still worth including, since they are easy to extract, and the whole set of unique, extracted names is small enough to be used as a search space for the machine learning.
The scope of how apps can receive data via BroadcastReceivers was also recently narrowed to a large degree by Google, due to privacy concerns. The upside is that apps cannot receive system-wide broadcasts unless they are already running. The downside is that scanners have to do static code analysis, and perhaps even dynamic analysis, in order to see which BroadcastReceiver IntentFilter Action names an app has declared.
The possibility of false positives is still there. For example, if someone makes a “build flavor” that builds without tracker SDKs but forgets to exclude the API Key Identifiers, then a simple scanner will flag this as tracking, even though it could not be. The tracker SDK is not included, which is the code that gathers and uploads the tracking data. In this example, the developer can easily fix it after a scanner flags the app as a tracker, by moving the API key configuration out of the “build flavor”.
A trickier case to review is when an app includes opt-in tracking. We believe that opt-in tracking and data reporting should not be flagged as a tracker, especially when the opt-in user experience makes it clear to the user what data is being gathered, and under what condition it is being sent. In that case, the simple scanner will flag the app, since it contains the API Key Identifier.
This is why we think that machine learning is very promising for tracking apps that track us. There are many good signals, but none of them definitely mark an app as a tracker. They must always be considered as a group with the whole picture, and given well-labeled data, machine learning can do this kind of task quite accurately.
Join the Hunt!
Finding API Key Identifiers is work that can be done in bite-sized
pieces, by people in their spare time. Many if not most tracker SDKs
require API keys in order to use their service, so start by looking
through ETIP for
entries that are missing
Api_key_ids entries. Usually, this is
documented in their SDK developer documentation. There are also many
SDKs which set the API Key via a method
rather than a declaration in an XML file. In that case, the API Key
Identifier might be found by reading the strings out of the JAR
file. We also welcome more information about BroadcastReceiver
declarations. We are tracking new data sources and approaches in our
For any kind of mass scanning to be usable, future work should focus on expanding the set of easy to extract features, and finding which of those are useful. Complicated and resource-intensive extractions like domain names, code signatures, and source/sink tracing still hold promise for delivering high accuracy, but would likely remain only useful when scanning individual or small sets of apps.
(This work was supported by NLnet’s NGI Zero PET fund.)