#### Lecture 14: Virtualization

Anton Burtsev November, 2021

#### Traditional operating system





#### Virtual machines



### A bit of history

- Virtual machines were popular in 60s-70s
  - Share resources of mainframe computers [Goldberg 1974]
  - Run multiple single-user operating systems
- Interest is lost by 80s-90s
  - Development of multi-user OS
  - Rapid drop in hardware cost
- Hardware support for virtualizaiton was lost





#### Virtual machine

Efficient duplicate of a real machine

- Compatibility
- Performance
- Isolation





#### What needs to be emulated?

- CPU and memory
  - Register state
  - Memory state
- Memory management unit
  - Page tables, segments
- Platform
  - Interrupt controller, timer, buses
- BIOS
- Peripheral devices
  - Disk, network interface, serial line

#### x86 is not virtualizable

- Some instructions (sensitive) read or update the state of virtual machine and don't trap (nonprivileged)
  - 17 sensitive, non-privileged instructions [Robin et al 2000]

### x86 is not virtualizable (II)

| Group                                | Instructions                                       |
|--------------------------------------|----------------------------------------------------|
| Access to interrupt flag             | pushf, popf, iret                                  |
| Visibility into segment descriptors  | lar, verr, verw, lsl                               |
| Segment manipulation instructions    | pop <seg>, push <seg>, mov <seg></seg></seg></seg> |
| Read-only access to privileged state | sgdt, sldt, sidt, smsw                             |
| Interrupt and gate instructions      | fcall, longjump, retfar, str, int <n></n>          |

#### • Examples

- popf doesn't update interrupt flag (IF)
  - Impossible to detect when guest disables interrupts
- push %cs can read code segment selector (%cs) and learn its CPL
  - Guest gets confused

#### Solution space

- Parse the instruction stream and detect all sensitive instructions dynamically
  - Interpretation (BOCHS, JSLinux)
  - Binary translation (VMWare, QEMU)
- Change the operating system
  - Paravirtualization (Xen, L4, Denali, Hyper-V)
- Make all sensitive instructions privileged!
  - Hardware supported virtualization (Xen, KVM, VMWare)
    - Intel VT-x, AMD SVM

# Basic blocks of a virtual machine monitor: QEMU example



Interpreted execution: BOCHS, JSLinux



## What does it mean to run guest?

- Bochs internal emulation loop
- Similar to nonpipelined CPU like 8086
- How many cycles per instruction?

Binary translation: VMWare/QEMU

```
int isPrime(int a) {
  for (int i = 2; i < a; i++) {</pre>
    if (a % i == 0) return 0;
  }
  return 1;
}
               %ecx, %edi ; %ecx = %edi (a)
isPrime:
         mov
             %esi, $2 ; i = 2
         mov
         cmp %esi, %ecx ; is i >= a?
         jge prime ; jump if yes
               %eax, %ecx ; set %eax = a
nexti:
         mov
                          ; sign-extend
         cdq
         idiv
               %esi ;a%i
               %edx, %edx ; is remainder zero?
         test
         jz
               notPrime ; jump if yes
         inc %esi ; i++
         cmp %esi, %ecx ; is i >= a?
               nexti ; jump if no
         jl
prime:
               %eax, $1 ; return value in %eax
         mov
         ret
               \%eax, \%eax; \%eax = 0
notPrime: xor
         ret
```

| isPrime:  |                                | %esi, \$2<br>%esi, %ecx                                             | ;;                                    |                                                                      |
|-----------|--------------------------------|---------------------------------------------------------------------|---------------------------------------|----------------------------------------------------------------------|
| nexti:    | test<br>jz<br>inc<br>cmp<br>jl | %esi<br>%edx, %edx<br>notPrime<br>%esi<br>%esi, %ecx<br>nexti       | · · · · · · · · · · · · · · · · · · · | is remainder zero?<br>jump if yes<br>i++<br>is i >= a?<br>jump if no |
| prime:    | ret                            |                                                                     | 9.59                                  | return value in %eax                                                 |
| notPrime: | xor<br>ret                     | %eax, %eax                                                          | ;                                     | %eax = 0                                                             |
| isPrime': | mov<br>cmp<br>jge              | %ecx, %edi<br>%esi, \$2<br>%esi, %ecx<br>[takenAddr]<br>[fallthrAdd |                                       |                                                                      |

isPrime': \*mov %ecx, %edi ; IDENT mov %esi, \$2 cmp %esi, %ecx jge [takenAddr] ; JCC ; fall-thru into next CCF nexti': %eax, %ecx ; IDENT \*mov cdq idiv %esi test %edx, %edx notPrime' ; JCC jz ; fall-thru into next CCF %esi ; IDENT \*inc cmp %esi, %ecx jl nexti' ; JCC jmp [fallthrAddr3] notPrime': \*xor %eax, %eax ; IDENT pop %r11 ; RET %gs:0xff39eb8(%rip), %rcx ; spill %rcx mov movzx %ecx, %r11b %gs:0xfc7dde0(8\*%rcx) jmp

#### Interpreted execution revisited: Bochs



#### Instruction trace cache

• How to make this loop faster?



#### Instruction trace cache

- 50% of time in the main loop
  - Fetch, decode, dispatch
- Trace cache (Bochs v2.3.6)
  - Hardware idea (Pentium 4)
  - Trace of up to 16 instructions (32K entries)
- 20% speedup

#### Improve branch prediction

```
void BX CPU C::SUB EdGd(bxInstruction c *i)

    20 cycles

 Bit32u op2 32, op1 32, diff 32;
                                              penalty on
 op2 32 = BX READ 32BIT REG(i - nnn);
                                              Core 2 Duo
                     // reg/reg format
  if (i->modC0()) {
    op1 32 = BX READ 32BIT REG(i->rm());
   diff 32 = op1 32 - op2 32:
   BX WRITE 32BIT REGZ(i->rm(), diff 32);
  else {
                      // mem/req format
    read RMW virtual dword(i->seg(),
        RMAddr(i), &op1 32);
    diff 32 - op1 32 - op2 32;
   Write RMW virtual dword(diff 32);
  SET LAZY FLAGS SUB32(op1 32, op2 32,
        diff 32);
```

#### Improve branch prediction

- Split handlers to avoid conditional logic
  - Decide the handler at decode time (15% speedup)

### Resolve memory references without misprediction

- Bochs v2.3.5 has 30 possible branch targets for the effective address computation
  - Effective Addr = (Base + Index\*Scale + Displacement) mod(2<sup>AddrSize</sup>)
  - **e.g.** Effective Addr = Base, Effective Addr = Displacement
  - 100% chance of misprediction
- Two techniques to improve prediction:
  - Reduce the number of targets: leave only 2 forms
  - Replicate indirect branch point
- 40% speedup

#### Time to boot Windows

|       | 1000 MHz    | 2533 MHz  | 2666 MHz   |
|-------|-------------|-----------|------------|
|       | Pentium III | Pentium 4 | Core 2 Duo |
| Bochs | 882         | 595       | 180        |
| 2.3.5 |             |           |            |
| Bochs | 609         | 533       | 157        |
| 2.3.6 |             |           |            |
| Bochs | 457         | 236       | 81         |
| 2.3.7 |             |           |            |

#### Cycle costs

|                                                   | Bochs 2.3.5 | Bochs 2.3.7 | QEMU 0.9.0 |
|---------------------------------------------------|-------------|-------------|------------|
| Register move<br>(MOV, MOVSX)                     | 43          | 15          | 6          |
| Register arithmetic<br>(ADD, SBB)                 | 64          | 25          | 6          |
| Floating point<br>multiply                        | 1054        | 351         | 27         |
| Memory store of constant                          | 99          | 59          | 5          |
| Pairs of memory<br>load and store<br>operations   | 193         | 98          | 14         |
| Non-atomic read-<br>modify-write                  | 112         | 75          | 10         |
| Indirect call<br>through guest<br>EAX register    | 190         | 109         | 197        |
| VirtualProtect<br>system call                     | 126952      | 63476       | 22593      |
| Page fault and handler                            | 888666      | 380857      | 156823     |
| Best case peak<br>guest execution<br>rate in MIPS | 62          | 177         | 444        |

#### Paravirtualization: Xen

#### Full virtualization

- Complete illusion of physical hardware
  - Trap <u>all</u> sensitive instructions
  - Example: page table update



#### Full virtualization

- Complete illusion of physical hardware
  - Trap <u>all</u> sensitive instructions
  - Example: page table update



#### Full virtualization

- Complete illusion of physical hardware
  - Trap <u>all</u> sensitive instructions
  - Example: page table update



#### Performance problems

- Traps are slow
- Binary translation is faster
  - For some events



#### Paravirtualization

- No illusion of hardware
- Instead: paravirtualized interface
  - Explicit hypervisor calls to update sensitive state
    - Page tables, interrupt flag
- But Guest OS needs porting
  - Applications run natively in Ring 3

#### Paravirtualization





#### Hardware support for virtualization: KVM



#### New mode of operation:VMX root

- VMX root operation
  - 4 privilege levels
- VMX non-root operation
  - 4 privilege levels as well, but unable to invoke VMX root instructions
  - Guest runs until it performs exception causing it to exit
  - Rich set of exit events
  - Guest state and exit reason are stored in VMCS

# Virtual machine control structure (VMCS)

- Guest State
  - Loaded on entries
  - Saved on exits
- Host State
  - Saved on entries
  - Loaded on exits
- Control fields
  - Execution control, exits control, entries control

#### Guest state

- Register state
- Non-register state
  - Activity state:
    - active
    - inactive (HLT, Shutdown, wait for Startup IPI interprocessor interrupt))
  - Interruptibility state

#### Host state

- Only register state
  - ALU registers,
- also:
  - Base page table address (CR3)
  - Segment selectors
  - Global descriptors table
  - Interrupt descriptors table

#### **VM-execution controls**

#### (asynchronous events control)



#### **VM-execution controls**

(synchronous events control, not all reasons are shown)



#### Exception bitmap

(one for each of 32 IA-32 exceptions)

- IA-32 defines 32 exception vectors (interrupts 0-31)
- Each of them is configured to cause or not VM-exit



### KVM



#### Nested page tables



#### Page table lookup



• 4-level page table

#### Nested page table lookup



#### Efficient I/O

#### Where is the bottleneck

- What is the bottleneck in case of virtualization?
  - CPU?
    - CPU bound workloads execute natively on the real CPU
    - Sometimes JIT compilation (binary translation makes them even faster [Dynamo]
  - Everything what is inside VM is fast!
- What is the most frequent operation disturbing execution of VM?
- Device I/O!
  - Disk, Network, Graphics

Xen









#### How to make the I/O fast?

- Take into account specifics of the devicedriver communication
  - Bulk
    - Large packets (512B 4K)
  - Session oriented
    - Connection is established once (during boot)
    - No short IPCs, like function calls
    - Costs of establishing an IPC channel are irrelevant
  - Throughput oriented
    - Devices have high delays anyway
  - Asynchronous
    - Again, no function calls, devices are already asynchronous

#### Shared rings and events













## Where is a performance bottleneck here?



#### Eliminate cache thrashing



#### GPUs

- Sending frames from the framebuffer
  - No hardware acceleration
  - Too slow
- OpenGL/DirectX level virtualization
  - Send high-level OpenGL commands over rings
  - OpenGL operations will be executed on the real GPU

#### Devices supporting virtualization



#### References

- A Comparison of Software and Hardware Techniques for x86 Virtualization. Keith Adams, Ole Agesen, ASPLOS'06
- Bringing Virtualization to the x86 Architecture with the Original VMware Workstation. Edouard Bugnion, Scott Devine, Mendel Rosenblum, Jeremy Sugerman, Edward Y. Wang, ACM TCS'12.
- Virtualization Without Direct Execution or Jitting: Designing a Portable Virtual Machine Infrastructure. Darek Mihocka, Stanislav Shwartsman, ISCA-35.