Why AVF?
AVF and pKVM: next-generation Trustworthy Execution Environment for connected products
This blog will double click on why Google built the Android Virtualization Framework (AVF), one of the most exciting advances in connected product systems software and security. A good starting point for the impetus behind AVF can be found here, explaining at a high level the importance of trustworthy (vs. just trusted) TEEs. Useful diagrammatic architectural views can be found here. AVF is a joint development between Google’s Android Security and Android Systems teams, with significant contributions from device partners and the open source community.
Why do we need TEEs in Android?
TEEs protect highly sensitive functionality expected to withstand compromise of the primary consumer OS, e.g. Android, Linux, iOS, Windows. These OSes contain millions of lines of code, undergo rapid development, and though they include sophisticated defenses, rely on security updates to address vulnerabilities and increase exploit cost and scaling. While timely patching protects against common attack threats, exploitation from high potential attackers is still possible (for a discussion of attack potentials, see this part of my blog on security labels).
Some existing examples of Android TEE components (TAs) expected to be resilient against compromise of the primary OS:
Keymint: protecting critical cryptographic execution and material (e.g. private keys)
Widevine: protecting high value media content
Biometrics: protecting biometric data and matching functions
Isolated Compilation: protecting integrity of run-time compiled code
But there are many other potential use cases of mobile hypervisors; for example:
Personal AI: as connected products increasingly leverage generative AI, such as private assistants that grow to understand you better based on your long-term device-based activity, on-device models are imbued with incredibly sensitive and valuable data that must be protected as strongly as a biometric (or stronger!)
Counter-abuse: today, most sophisticated counter-abuse functions run on the cloud due to resource requirements and to protect them from being compromised by the very attackers they seek to counter; deploying counter-abuse tech in a TEE can provide enhanced visibility and detection while still guarding against tampering and evasion. Also, performing counter-abuse on the device is better for privacy since sensitive intelligence signals need not be collected to the cloud.
Heterogeneous compute: like we have on desktop and cloud, connected mobile devices can benefit from the ability to concurrently host multiple operating system environments, such as one for smartphone and a different one for laptop/desktop or real-time workloads in a car.
Manufacturer use cases: in addition to standardized Android TAs, manufacturers may deploy device-specific TAs for custom biometrics or other security functions, proprietary AI services, or really any other firmware component that would previously be deployed via TrustZone (but now realized in a more efficient architecture).
Note that there are different types of TEEs for different threat models, e.g. those that run on the same application processor as the primary OS and rely on the processor memory protection to defend against software-based attacks (e.g. from apps that leverage vulns to gain full privileges of the primary OS) as well as those that can defend against sophisticated physical attacks. This blog is mostly concerned about the former given its broad range of user journeys.
Problems with Legacy TEEs
TEEs within connected devices have long been realized through SoC-specific firmware running in ARM TrustZone. Android ecosystem policies did not include stringent independent evaluation requirements on TEEs, and some of them have suffered from serious security issues. Some don’t support basic CPU virtual memory protections. Some effort has been made to create a standard evaluation scheme for TrustZone TEEs, but unfortunately the current standard does not reach moderate attack potential, which is the bare minimum that service providers expect in this type of TEE.
Another problem with legacy TEEs is high maintenance cost due to hardware and software incompatibilities. There are many TrustZone OS solutions in the market, some developed by SoC manufacturers and other third party offerings. That means that a device manufacturer must manage a complex supply chain and the testing and compatibility requirements that come with it, as they deploy products on multiple different SoCs within a product family and over time. Software tools and APIs for TAs are also not well standardized, so a device manufacturer can’t simply implement a single Keymint TA and have it run on any TEE; they must continuously port the TA using whatever APIs and tooling is provided by each TEE software vendor.
TrustZone also presents production cost and flexibility issues, due to its requirement to statically carve out RAM and peripherals access between the TrustZone and non-TrustZone environments. The inability to dynamically allocate resources for maximum efficiency (within security policy guardrails) is akin to reverting decades of progress in virtual memory operating systems. In addition, because TrustZone software is often provided by SoC manufacturers, OEMs have had limited control and flexibility over the use of TEE for their use cases.
AVF: the nextgen Android TEE
AVF is designed to address all these problems:
Assurance: pKVM, the reference hypervisor portion of AVF, is designed to meet high assurance requirements and in 2024 is undergoing high attack potential security evaluation by independent security experts. Manufacturers may use other hypervisors as long as they meet the assurance and compatibility requirements defined in current and future revisions of the Android Compatibility Definition Document (CDD) and Google Mobile Services requirements (for Google-certified Android implementations).
Maintenance: AVF’s standardized interfaces enable a pVM (protected virtual machine) to execute on any compatible hypervisor, with standardized Android-compatible APIs for IPC and virtual memory management. AVF follows the rigorous API development and standardized life-cycle process followed by Android itself, so component and device manufacturers can rely on it (and contribute to it!). pKVM is designed to run on any ARM64-based connected device that can run Linux, regardless of SoC choice.
Efficiency: AVF, like hypervisor technologies in the cloud, enables secure dynamic assignment of resources to workloads, improving production cost and unlocking new use cases.
Flexibility: pVMs are abstracted from SoC-specific hardware details, unlocking device manufacturers from dependencies that have traditionally limited their control and flexibility in leveraging TEEs for their specific use cases. OEMs can also control the use of pVMs to ensure that workloads are constrained to system requirements and not exposed to risky third party customizations.
pKVM deserves a little more discussion. As someone who’s worked on hypervisors for a good chunk of his career, I can’t stress enough how amazing pKVM is. A hypervisor for connected devices should have these attributes:
Open source and binary transparent
High assurance (per above), which implies Type-1, where workload memory isolation is performed at a layer below, de-privileging the primary OSes
Supported by a global developer community to ensure long-term efficiency, quality, maintainability, viability, and trust
Standalone Type-1 hypervisors have not been able to meet all these goals. Some are proprietary. Some are open source but were designed for use in a single hardware product line and lack the global community to drive adoption and long-term viability.
Protected KVM (pKVM) solves for all of these challenges. Yes, the hypervisor mode code is very small and able to meet high levels of assurance. Yes, it’s open source. But the key innovation for pKVM is how it was built into Linux - but without running inside or above the Linux kernel (as was the case with the original Linux KVM architecture). Now, high assurance and Linux may seem like an oxymoron. But the brilliance of pKVM lies in its architectural approach which relies on the Linux kernel for its rigorous open source development process and facilities used to configure and manage hypervisor-based systems. However, during the boot process, the “host” Linux kernel de-privileges itself, removing its ability to access hypervisor or guest memory. This leaves the hypervisor, consisting of only several thousand lines of carefully crafted code (compared to the millions of lines of code in Linux proper), in charge of the secure isolation properties upon which the entire system depends during run-time. Because of the Linux integration, security features like verified boot, binary transparency, and OTA security updates are supported naturally as part of the Android security architecture. Even if hypervisor bugs are extremely rare, patching needs may arise from CPU errata, or novel attack mitigations (e.g. side channels). It’s also worth noting that while pKVM is the reference hypervisor for AVF, pKVM, as an upstream Linux capability, can be used in non-Android based connected products as well.
Going forward, Google is working hard, together with its partner ecosystem, to migrate legacy TA workloads from TrustZone to AVF, while also enabling emerging use cases, such as personalized, private AI. The Isolation Compilation TA has already been delivered in the market, and over time Google will develop additional reference implementations that will provide efficient, trustworthy firmware that consistently meets high assurance levels while reducing manufacturer and developer cost across the Android ecosystem.
Very informative!