Replay interposition infrastructure

Replay interposition layer

Replay infrastructure: XenTT replay infrastructure, consists of four main logging components, and a high-bandwidth communication channel across them (the figure above):

  • Event interposition: The event interposition layer implements logging and replay of the low-level virtual machine interface exported by the Xen hypervisor. We design interposition primitives with a goal to introduce a minimal overhead on the critical execution path of the system. A lightweight logging operation requires to read the hardware state of the system and put a logging record in a lock-free, shared-memory buffer for asynchronous processing by the user-level logging daemon.
  • Logging and replay daemons: User-level logging and replay daemons process the log of recorded events committing it to a stable storage.
  • Device daemon: To log and replay communication of virtual devices we rely on the fact that all Xen devices use a uniform shared memory interface for connecting guest and host device drivers. The device daemon implements a general abstraction for interposing on communication of the shared memory producer-consumer buffers.
  • Replay coordination: Finally, replay coordination mechanisms ensure controlled execution of the guest system between a pair of nondeterministic events. These mechanisms include branch counting logic, single-step execution, replay of synchronous and asynchronous events, and CPU branch tracing.

Logging, replay, and device daemons are implemented as a single ttd-deviced application, but they are logically separate.

Source tree

 |  |->common/
 |       |->branch_counting.c           - branch counting logic (setup, reset, accounting)
 |       |->hw_branch_trace.c           - Support for hardware branch tracing (see Section 18.5 of Intel Development Manual)
 |       |->loggging-devd.c             - Xen-level support for logging devices from the user-level ttd-deviced
 |       |->loggging.c                  - logging primitives, communication with the logging daemon
 |       |->replay.c                    - replay primitives, controlled execution, communication with the replay daemon
 |       |->ring-channel.c              - fast ring-based communication channel for relaying recorded events to user-level daemons
 |       |->vmi.c                       - support for VMI probes -- determinism during replay
 |  |->/drivers/xen/ttd/
 |                   |->ttd-logging.c   - logging primitives for the dom0 kernel device payload
 |                   |->ttd-relay.c     - fast logging of device payload without leaving the dom0 kernel (straight into a file)
 |                   |->ttd-replay.c    - replay primitives for the dom0 kernel device payload 
 |  |->/drivers/xen/netback/*.c         - patches to netback: connect to ttd-deviced for the TT domains, payload logging
 |  |->/drivers/xen/blkback/*.c         - patches to blkback: connect to ttd-deviced for the TT domains, payload logging
 |  |->ttd/
 |      |->deviced/*                    - logging, replay, and device interposition
 |      |->ttd-read-log.c               - a tool for reading raw log files, and compare them
 |      |->tth-resolve-bts-dump.c       - a tool for reading raw BTS traces (Xen, and dom0 debugging)
 |  |->python/xen/*                     - patches for creating TT domains
 |  |->libxc/*                          - minor patches for some missing functions
 |  |->xenstore/*                       - patches to Xenstore (connect to ttd-deviced for the TT domains, transaction determinism)  
 |  |->console/*                        - patches to consoled (connect to ttd-deviced for the TT domains)

Implementing efficient replay

Efficient implementation of a deterministic replay relies on a key observation that execution of a system is largely deterministic, and is only occasionally altered by external non-deterministic events. A system can be viewed as a chain of deterministic executions connected by non-deterministic transitions. This is a powerful idea. Starting from the initial state, one can force the system to repeat its original execution by just replaying external events.

Two conditions should hold for the above approach to work. First, determinism of execution has to be preserved between non-deterministic events. Second, all nondeterministic input should be available for interposition during logging and replay. For a complex system these two conditions are complementary. If it is impossible to provide determinism of execution between non-deterministic events, a violation of determinism can be captured as an external non-deterministic event. Similar, if it is possible to change a replay boundary including or excluding parts of a system without breaking determinism, one can reduce the amount of non-determinism which has to be logged.

Identifying sources of nondeterminism: The implementation of any replay system requires identifying all sources of nondeterminism. A many-hour analysis is required to design a replay interface for a system with a highly optimized low-level interface that involves communication via shared memory, asynchronous interrupts, virtual memory management primitives, virtualiziation of time, and sensitive machine instructions.

We aid the manual analysis of nondeterminism with a run-time mechanism automatically informing us when an undetected nondeterministic event is delivered to a guest system. The main insight behind our approach is to treat the guest virtual machine as a black box of memory pages. All nondeterministic events are viewed as updates to this memory. During implementation, we rely on hardware memory protection mechanisms to detect updates to this memory. Our interposition primitives remove memory protection at the moment they log and deliver an event to the system.

Main sources of nondeterminism are...

Logging and replay infrastructure: A logging infrastructure is aimed to provide an efficient way of recording a large number of nondeterministic events to a permanent replay store. The performance of the logging mechanisms defines performance of the whole replay infrastructure. Two main challenges in implementing event logging are (1) overheads of accessing the system state, and (2) the need to relay large amounts of event data to a permanent store.

Every external event is described by a pair of event-specific data and a timestamp. Event data varies from a byte to several kilobytes. An event timestamp uniquely identifies the position in the execution of a program at which the event occurs. To implement the timestamp, we rely on the hardware branch-counting mechanisms provided by modern x86 CPUs. A number of taken branches and instruction pointer uniquely define the number of instructions executed by the system. We use this tuple as a timestamp.

Recording of the timestamp and saving an event to a permanent store should minimally disturb execution of a system. This is challenging. To read the timestamp, a logging mechanism must at least preempt execution of the guest system to save its instruction pointer. Events like interrupts or event channel notifications do not incur this preemption cost since a guest system is already preempted due to the virtualization of these events. However, for asynchronous updates of shared memory, the logging mechanisms should force preemption to record the state of a guest system. In case a shared memory update happens on a different CPU, recording incurs additional costs of an inter-processor interrupt.

During replay we rely on a combination of hardware branch counters and single-step execution to inject nondeterministic events into the execution of the system at the exact places they occurred during the original run. We configure branch counters to overflow and raise an interrupt several branches before the original event took place. This is done to address a hardware delay in receiving the interrupt. After that we navigate to the original place by configuring a guest system to run in a single instruction step mode.