Fast context switching with VMFUNC

VMFUNC is a new Intel primitive that allows to change an EPT page table underneath a VT-x VM without exiting into the hypervisor. Effectively, it's a page table switch in hardware and thus it allows one to build a fast context switch.

How it works

Each VT-x virtual machine is configured with a Virtual Machine Control Structure (VMCS). This is a page of memory in which the VMM writes configuration data for things like how interrupts are handled, initial control register values during guest entry, and a whole bunch of other things.

One of those other things is a pointer to a page of candidate EPT pointers. These are pointers to different EPT page table hierarchies, each one giving a possibly different physical -> machine mapping. The VMM sets up this page of EPT pointers and has to also turn on a couple other settings in the VMCS to fully enable EPT switching via VMFUNC.

In non-root operation (inside a VM) code running in any privilege level can switch EPT hierarchies through the following steps:

  • Storing 0 in %rax (EPT switching is VMFUNC 0)
  • Storing the index into the candidate EPT table in %rcx
  • Invoking the VMFUNC instruction

The processor will switch EPTs. Invoking VMFUNC will not cause a VM Exit.

All of this is detailed in the Intel SDM Volume 3, 25.5.5.3 "EPT Switching".

It is worth noting that this will not change/save values in control registers (e.g. %cr3), general purpose registers, and so on. It's up to the code and VMM to set things up so it all works gracefully.

Fast context switch

How to build a fast context switch. This paper is a good introduction:

  Thwarting Memory Disclosure with Efficient Hypervisor-enforced Intra-domain Isolation

But also, what is the fastest context switch currently? For that we have to look at Gernot's papers, like this one:

  For a Microkernel, a Big Lock Is Fine                                                                                                                                             

This is just the numbers... we need to dig seL4 implementation (I can do that later, maybe we'll spend a separate seminar on that).

Discussion

Are TLBs flushed?

No, unless VPIDs are not being used (which I, Charlie, would say is rare). See Intel SDM Volume 3, 25.5.5.3 "EPT Switching" and 28.3.3.1 "Operations that Invalidate Cached Mappings".

Is it actually faster than a normal page-table switch with CR3 reload?

It can't be.

Does swapping EPTP (via VMFUNC 0) swap VMCS?

It seems like the active VMCS probably doesn't change. We need to check the Intel docs. If VMCS isn't changed, then the set of allowed ETPs doesn't change when moving between a source/target ETP. This might make it hard to stop ROP attacks on the source, and it might make it the 'springboard' approach to stopping ROPs that I (Ryan) described impossible.

Attack#1 VMFUNC faking attack through guest's virtual to physical memory remapping

Guests control their virtual (GVA) to physical (GPA) mappings (through traditional page tables that are fully virtualized in non-root VTX context). A guest (the Non-secret Compartment on the figure below) can remap its page that contains a vmfunc instruction (light green page) into a virtual address that is mapped as executable in another compartment (light blue page in the Secret Compartment). If guest then jumps to the vmfunc instruction it triggers an EPT switch. Since the next instruction after VMFUNC is mapped as executable in the callee compartment the execution continues there without an exception. But since the caller is free to pick any VMFUNC location in its own page, it can effectively pass control anywhere in the callee compartment, thus potentially overwriting callee's data, or even reading it by moving it into a shared memory region.


GVA, GPA, and machine memory mappings illustrating this attack (Attach:vmfunc-faking-attack.svg | .svg)

Anton: I don't understand how SeCage protects against this attack, can anyone comment?

Charlie: I've gone back and forth, but I think they are covered. The paper doesn't say anything about the virtual->physical mappings changing after a switch. I'm guessing what they do is the virtual->physical mappings are always there (they're the same for all compartments), but the switch will switch EPTs, and certain physical addresses will suddenly become backed by real machine memory. If this is what they do, I think it's a subtle but important point that they left out.

What this means is that if you modify virtual->physical mappings in the EPT-N, the same mappings will apply in the EPT-S (see Figure below). This means the mapping you set up in EPT-N with the manufactured vmfunc will still be live after the switch, and EPT-S will continue trying to use that machine page (via the guest physical address). What this means in terms of your picture is that the light blue R/E block will be mapped to a physical address way over on the left. Most likely, when EPT-S tries to invoke the next instruction, it will trigger an EPT violation (the green block is not mapped in the EPT-S's guest physical); the case analysis falls under their discussion in 6.2 under "VMFUNC faking attack".


GVA, GPA, and machine memory mappings illustrating why attack fails if guest page tables are shared (Attach:vmfunc-faking-attack-same-page-tables.svg | .svg)

Since I spent 30 minutes writing it all out, I'll keep the case analysis I did. Consider where you might write the vmfunc:

  • Case 1: the main compartment writes vmfunc in the heap. When the main compartment invokes vmfunc, there are three possible outcomes:
    • The main compartment would trigger an EPT violation (if the heap is mapped NX in the main compartment); not clear if this is the case. My guess is no.
    • The secret compartment would trigger an EPT violation. The secret compartment uses its own heap that is not shared with the main compartment, correct? (Anton: Not clear, they should be able to share data (like the data path, right? we can assume they set up a special shared region NX for that)). Not clear if the secret compartment ever has access to the main compartment's heap. Since the secret compartments can't do I/O, maybe this is true. (The main compartment also contains the guest OS, right?)
    • It would generate an interrupt in the secret compartment (int3). This is if the main compartment writes the vmfunc at the end of the page, and somehow after the switch, the next instruction happens to be on a secret code page.
  • Case 2: the main compartment writes vmfunc into the data section. I'm guessing this means "data section of the application". The guest OS memory is out of the picture (and just like the heap, is not even mapped in the EPT-S anyway). This is similar to Case 1, but the secret compartment triggers an EPT violation because the data is mapped NX (in Case 1, the memory wasn't mapped at all).
  • Case 3: the main compartment writes into the code section. It can't write into the secret functions code section since this isn't even mapped in its physical address space. And the EPT-S doesn't have the main compartment's code mapped either, so it shouldn't every use it. So, this case is similar to Cases 1 and 2.

It's worth noting that an EPT switch can trigger a virtual->physical mapping switch as well, if you suddenly map a different set of virtual->physical page tables at the physical address stored in %cr3 that have entirely different mappings set up.

Anton: I buy Charlie's explanation, but I'm still confused about the last point. What if non-secret compartment sets up a new page table in a shared heap? Supposedly this can be reduced to the above analysis, right? Quest can control its own page tables anyway. As long as physical addresses of secret and non-secret code are non-overlaping the attack fails, right?

Attack #2: Remapping other-sides virtual memory

From the above Attack #1 we can make a conclusion that physical addresses allocated for secret and non-secret compartments should not overlap. We illustrate it with a Figure below in which a non-secret compartment is allowed to grow on the left side of the two shared pages (trampoline and page table directory), and secret compartment stays on the right. This isolation invariant ensures that the above virtual page remapping attack is impossible.


Isolation of physical and virtual addresses (Attach:vmfunc-physical-address-space-isolation.svg | .svg)

Charlie however points out that without additional measures the following attack is still possible: one compartment can change GVA to GPA mapping of the pages for another compartment, e.g., the non-secret compartment can update GVA mappings for the right side of memory allocated for the secret compartment and trigger unintended code execution inside the secret compartment after the VMFUNC and potentially leak secrets or construct arbitrary code execution.


Memory mappings illustrating Attack #2 (Attach:vmfunc-page-confusion-attack.svg | .svg)

Defense: Charlie and I suggest the following defense: the page table directory (level 1) and each half of page table level 2 pages should be mapped as R/O in corresponding compartments. If only half of level 2 page tables is writable by the non-secret compartment (the half that describes the lower (left) part of physical memory) than the non-secret compartment cannot remap pages in the other half.