Free Software Tooling for Android Feature Extraction


As part of the Tracking the Trackers project, we are inspecting thousands of Android apps to see what kinds of tracking we can find. We are looking at both the binary APK files as well as the source code. Source code is of course easy to inspect, since it is already a form that is meant to be read and reviewed by people. Android APK binaries are a very different story. They are first and foremost a machine-executable format. On top of that, many developers deliberately obfuscate as much as possible in the APK to resist inspection.

That means inspection requires using tools to look into the binary APK format. There is actually a massive amount of work that goes into inspecting APKs because this is required in order to do useful malware analysis. For the most part, these inspection techniques are the malware companies’ “special sauce”, so they are proprietary and generally kept secret. On top of that, malware companies keep secret a lot of the conclusions they about what is useful data to collect, and what should be ignored.

One key piece of the Tracking the Trackers project is to make all of research, tooling, and conclusions free, open, and publicly available. First and foremost, that means the tools must be free software. They should also be easily installable so the barrier to entry for new inspectors is as low as possible. We focus on getting software as part of Debian, since once there, so many people have access to those packages since Ubuntu, Kali, and so many other GNU/Linux distros are based on Debian.

What is available in Debian already

Our work with the Debian Android Tools Team and Debian Java Team over the years means many key tools are already included in Debian and its derivatives, including:

  • key Android SDK components like apksigner, dx and android.jar
  • apktool
  • dexdump/dexlist
  • enjarify
  • LibScout
  • libsmali
  • procyon

Tools we are using

One key aspect of our research is that working with terabytes of APKs, this is necessary to be able to spot and map out as many trackers as possible. Since feature extraction can be a slow and resource intensive process, we needed to use some tools that emphasize speed over flexibility. Even with fast extraction tools, we still have to build up tailored processes to speed things up. Some of these straightforward feature extraction processes would take months to run on ~3TB of APKs on a 32-thread machine with 144GB of RAM.

apkverifier, apkparser, and droidlysis are generally useful, but not yet in Debian. So we packaged them to make them easily available. They are currently in the Debian NEW queue, awaiting final review before inclusion.

These tools have been assembled into scripts to run the actual feature extract processes, they are maintained in the https://gitlab.com/trackingthetrackers/extracted-features repo. When the actual data generated is small enough and there are not copyright conflicts, the data is also included there. Mostly, the data sets are too large and sometimes touch on copyright restrictions, so they are unfortunately not publicly available.

There are lots of other tasks, including managing large APK collections, gathering data to generate statistics about the features, and downloading publicly available tracker SDK. Those scripts are maintained in https://gitlab.com/trackingthetrackers/scripts.

Gradle Plugins

When working with source code, then it is possible to do other kinds of analysis. Most Android apps are built with the Gradle tool. So we reviewed a wide range of Gradle plugins, and found these three useful in our investigations.

Tools we reviewed

We looked at quite a few existing tools, and found many interesting and useful ones. While they all produced useful output, many of these were not useful to this project because they were tailored around the use case of a person inspecting a small set of apps, so for example, they were too slow or did not produce machine readable output suitable for working with large APK collections.

  • android_permissions_harvester - for finding which permissions are used based on method calls
  • droidlysis - cryptax’s (aXelle’s) tool: “DroidLysis is a property extractor for Android apps”. See also her talk at hacklu 2019
  • APKiD - “In addition to detecting packers, obfuscators, and other weird stuff, it can also identify if an app was compiled by the standard Android compilers or dexlib”[1]
  • redex - “taking advantage of Redex allows us to normalise the applications prior to analysis”[1]
  • kaitai_struct_formats - generic binary struct parser tool, useful for directly parsing Android classes.dex files.
  • binaryanalysis-ng - a framework for unpacking files recursively and running checks on the unpacked files. Great for someone who needs to inspect small sets of a wide variety of file types.
  • redexer - infer with which parameters the app uses certain permissions (we name this feature RefineDroid)
  • apk-static-xref - staticallly generate a cross-reference-graph (XRG) of a component (e.g., Service) of Android APK file
  • smalisca - Static Code analysis tool that generates call graphs

(This work was supported by NLnet’s NGI Zero PET fund.)