Chapter 9
Implementation of HLS in a General-Purpose OS

This chapter describes the implementation of the hierarchical scheduler infrastructure in the Windows 2000 kernel. Also, the implementation of four loadable schedulers is described: a time-sharing scheduler that can also act as a fixed-priority scheduler, a proportional share scheduler, a reservation scheduler, and a join scheduler.

9.1 Background and Implementation-Level Design Decisions

9.1.1 Windows 2000 Background

It is necessary to present some background about the structure of Windows 2000 before going into detail about the execution environment for loadable schedulers.1 Windows 2000, like its predecessor Windows NT, is a preemptible multiprocessor operating system. This means that multiple application threads can be executing in the kernel concurrently. Unlike traditional Unix-like operating systems, threads executing in the kernel can be preempted by other threads. Therefore, thread dispatch latency--the time between when a thread becomes ready and when it first begins to execute--is usually very low for high-priority threads. In contrast, in a non-preemptible operating system a high-priority thread may awaken while a low-priority thread is executing a slow system call. This leads to a priority inversion (a low-priority task preventing a high-priority task from running) until the system call finishes.

Each processor in Windows 2000 has an associated interrupt request level (IRQL) that determines what events may preempt the activity running on that processor. Normal user- and kernel-level threads (of any priority) run at passive IRQL; they may be preempted by any hardware or software interrupt handler. Each hardware interrupt handler executes at a specific IRQL. To protect against being preempted by an interrupt, it is sufficient for a block of code to raise the processor IRQL to the IRQL of the corresponding interrupt. This is because an interrupt is signaled on a processor only when the processor’s IRQL is lower than the IRQL of the interrupt.

The Windows 2000 scheduler executes at dispatch IRQL, in the context of a software interrupt called the dispatch interrupt. All interrupt service routines for hardware devices have higher IRQLs than the dispatch IRQL, which is higher than the passive IRQL. Typical use of the dispatch interrupt would be as follows: a keyboard interrupt handler awakens a thread that was blocked waiting for keyboard input. As part of the process of awakening the thread, the kernel requests a dispatch interrupt. Because the IRQL of the keyboard interrupt handler is higher than the dispatch IRQL, the dispatch interrupt is deferred until the keyboard interrupt handler exits, restoring the processor IRQL to its previous value. As soon as the IRQL is below the dispatch IRQL, the dispatch interrupt is signaled and the Windows 2000 scheduler potentially dispatches the newly awakened thread.

In addition to running scheduler code, the dispatch interrupt handler has another function: running deferred procedure calls (DPCs). DPCs were designed to give device drivers access to high-priority asynchronous processing outside of the context of hardware interrupt handlers. Because code running at dispatch IRQL or higher cannot be preempted by the scheduler, it is essential that code spend as little time as possible running with elevated IRQL.

Mutual exclusion in Windows 2000 is provided by both blocking locks and spinlocks. Blocking locks are suitable for protecting resources that may be busy for a long period of time. Code that executes at dispatch IRQL or higher may not block: this leads to deadlock since the scheduler, which needs to select the next thread, cannot be entered until the IRQL falls below the dispatch IRQL. Spinlocks always run at dispatch IRQL or higher.2

All Windows 2000 scheduler data structures are protected by a spinlock called the dispatcher database lock. Therefore, access to the scheduler is serialized across all processors. Contention for the dispatcher database lock is a potential cause of scalability problems on large multiprocessors that run scheduler-intensive workloads--other multiprocessor operating systems such as IRIX and Solaris have per-processor scheduler data structures, requiring less synchronization between processors.

9.1.2 The Loadable Scheduler Execution Environment

Hierarchical schedulers are implemented as loadable device drivers. In Windows 2000, a subset of the internal kernel APIs are available to loadable drivers, but by default this subset is not powerful enough to allow drivers to act as schedulers. The hierarchical scheduler infrastructure exports the additional kernel entry points that are necessary for loadable modules to act as schedulers.

9.1.2.1 Serialization

The scheduling hierarchy, like the native Windows 2000 scheduler, is protected by the dispatcher database lock. Thus, the scalability of HLS is only worse than the scalability of the native Windows 2000 scheduler to the extent that hierarchical schedulers are less efficient. Removing the dispatcher database lock in order to provide per-processor scheduling data structures in Windows 2000 would be a major undertaking, and is beyond the scope of this work.

9.1.2.2 Blocking

Since it executes at dispatch IRQL, loadable scheduler code must not incur a page fault or block for any other reason. In general, it does not make sense for a scheduler to block, since schedulers do not encapsulate a control flow in the same way that threads do. Rather, schedulers are coroutine-based entities that execute in a restricted environment. When a user-level thread sends a message to a scheduler, the scheduler infrastructure helps schedulers avoid blocking by copying data from the process address space (potentially incurring a page fault) into non-pageable memory while still executing in the context of the user thread. Then, the dispatcher database lock is acquired and schedulers may access both their own data structures and data copied from the requesting thread without risk of blocking.

9.1.2.3 Memory Allocation

To avoid blocking, all memory that is reachable from a loadable scheduler must be non-pageable (that is, pinned to physical memory pages). Unfortunately, Windows 2000 makes it impossible to dynamically allocate non-pageable memory while the kernel is executing at dispatch IRQL because the memory allocation routine itself may block when there is a shortage of free memory pages. HLS works around this difficulty by allocating a block of memory at boot time and sub-allocating it using its own versions of malloc() and free(). Schedulers are assumed to not require more dynamic memory than HLS reserved at boot time.

9.2 Implementation of the Hierarchical Scheduler Infrastructure

At a high level, to ensure proper operation of loadable schedulers it is necessary for the HSI (1) to be notified of each relevant OS event such as thread blocking, unblocking, creation, and deletion, (2) to be able to make scheduling decisions (i.e., to dispatch threads), and (3) to defeat the native Windows 2000 scheduler, preventing it from ever making a scheduling decision.

9.2.1 Simplifying Thread States

There are seven possible thread states in Windows 2000, in addition to an anomalous condition where a thread is in the ready state but is not able to run. These states can be divided into two groups: states that concern the CPU scheduler, and states that serve some other purpose. HLS only allows hierarchical schedulers to see the states that could affect a thread scheduler. Table 9.1shows the mapping from Windows 2000 states to HLS states. Transitions between Windows 2000 states that HLS treats as equivalent do not result in notifications being sent to the scheduler infrastructure. Transitions between states that are not equivalent must result in notifications being sent.




Windows 2000 state HLS state


Initialized
Terminated no equivalent


Waiting
Transition Waiting
Ready (on process ready queue)


Ready (not on process ready queue) Ready


Standby
Running Running


Table 9.1: Mapping of Windows 2000 thread states to HLS virtual processor states

Initialized threads have some of their data structures allocated but are not yet ready to run. Similarly, terminated threads will execute no more instructions, but have not yet been deallocated. The HSI is designed such that hierarchical schedulers never see threads in either of these two states.

The waiting state has the same meaning in Windows 2000 as it does in HLS: threads in that state cannot execute until some external condition is satisfied. When Windows 2000 is under memory pressure, it can reclaim memory in a number of ways. One of them is to swap out the kernel stacks of blocked threads, and another is to swap out entire processes; this can be done when all threads belonging to a process are blocked. Threads whose kernel stacks have been swapped out are put into the transition state while they wait for the stack to be paged back into memory. This special case of the waiting state is irrelevant to HLS and it considers threads to be waiting while they are in transition.

Similarly, when a thread is awakened and its process is swapped out, the thread moves into the ready state and it is marked as being on the process ready queue. This condition is misleading as the thread in not actually ready to execute: it must wait for its process to be swapped back into memory before it is taken off of the process ready queue and made runnable. In other words, a ready thread that is on the process ready queue is, for all practical purposes, waiting--it is treated as such by HLS.

The running state has the same meaning in Windows 2000 and HLS: it means that a thread is executing on a physical processor. The Windows 2000 state standby denotes a thread that is “on deck,” or about to execute. The standby state is used by Windows 2000 as follows. Assume that a high-priority thread running on processor 0 exits a critical section, allowing a low-priority thread to wake up and enter the protected region. The low-priority thread cannot preempt the high priority thread, but it can be dispatched on a different processor. In this case, Windows 2000 marks the low-priority thread as being on standby on another processor and sends the processor an interprocessor interrupt. When the IPI is received, the other processor enters low-level dispatching code that sees the standby thread and dispatches it. HLS insulates schedulers from low-level details like this, and considers a thread to be running as soon as it enters the Windows 2000 standby state.

9.2.2 Suppressing the Native Windows 2000 Scheduler

Although schedulers in the hierarchical scheduler architecture subsume the functionality of the native Windows 2000 time-sharing scheduler, the Windows 2000 scheduling code still executes in an HLS system--it was not eliminated since the prototype HLS implementation attempts to be as unintrusive as possible. Since the Windows 2000 scheduler is never permitted to make a scheduling decision it is mostly vestigial, but it still performs a few useful functions. Windows 2000 contains a number of specialized scheduling heuristics such as raising the priority of threads after they unblock, boosting the priorities of threads that have been runnable but not scheduled for several seconds, and extending the length of scheduling quanta of threads in the foreground process. The latter heuristic is not carried over from the native Windows 2000 scheduler to HLS (although it could be implemented separately in a hierarchical scheduler). The first two heuristics, however, use an internal kernel interface to raise the priority of a thread. Their benefits are preserved in HLS by having bottom schedulers convert a request to set thread priority into an HLS message and sending the message to its parent. If the parent scheduler is not capable of acting on this message (for example, if it is a reservation scheduler), the message is ignored. Otherwise, the priority boost and subsequent decay happen in HLS just as they did under the native Windows 2000 scheduler. The end result of this is that the default HLS scheduler, a priority and round-robin scheduler described in Section 9.3.1, behaves very much like the native Windows 2000 scheduler.

9.2.3 Allowing the HSI to Schedule Threads

When a hierarchical scheduler grants a processor to a bottom scheduler, the thread that the bottom scheduler represents can usually be immediately dispatched. To do this, the scheduler sets the thread’s state to standby on the processor that was granted and then requests a dispatch interrupt for that processor. The dispatch interrupt handler then moves the thread from standby to running and switches to that thread’s context.

This simple method for scheduling threads does not work when the thread to be scheduled is already running on a processor other than its intended target. In this case, the HSI requests dispatch interrupts on both the current and target processors. Since the dispatch interrupt handler acquires the dispatcher database lock, the handlers are serialized. If the processor that is currently running the thread runs first, then it can deschedule the thread, allowing the other processor to schedule the thread normally. However, there is a complication if the target processor enters the dispatch interrupt handler first--it cannot schedule the thread until the other processor has preempted it and saved its state. To deal with this case, the HSI forces the target processor to idle until the thread’s current processor moves the thread from the running to the ready state, at which point the target processor can put the thread back into the running state and dispatch it.

The reality is considerably more complicated than this two-case analysis, since in between a processor idling and picking up the thread, a scheduler could decide to schedule that thread on a completely different processor, the thread could exit, it could be requested to be moved to a different scheduler, etc. The HSI hides this complexity from loadable schedulers.

9.2.4 Interposing on Windows 2000 Scheduling Events

Since Windows 2000 was not designed to support loadable schedulers, suppressing its scheduling decisions and implementing new scheduling behavior required inspecting several dozen kernel source files that have to do with thread management, and changing some of them. Table 9.2 shows the number of non-trivial modifications to Windows 2000 that were required to notify the HSI of all events of interest, allow it to make scheduling decisions, and to suppress the native scheduler.




Functionality Modified code sites


Initialize HSI 1
Notify HSI of blocking thread 7
Notify HSI of unblocking thread 8
Notify HSI of thread create 3
Notify HSI of thread exit 1
Notify HSI of change in thread priority 4
Suppress native Windows 2000 scheduler 11
Other 2


Table 9.2: Summary of changes to Windows 2000 to support hierarchical scheduling

Early in the loadable scheduler implementation, a passive version of these changes was implemented, whose purpose was to “shadow” the state changes of Windows 2000 threads. In other words, the HSI never made a scheduling decision, but rather, it kept an up-to-date version of the state of each thread in the system, and signaled an error if this state ever disagreed with the state that Windows 2000 believed the thread to be in. After thoroughly inspecting the native Windows 2000 scheduling code and subjecting a system containing a passive version of the HSI to memory pressure and high load without observing any disagreements in thread states, it was assumed that these changes correctly notify the HSI of all events of interest.

The two modifications listed as “other” in Table 9.2 refer to extensive modifications to the scheduler routine KiReadyThread and to the idle loop.

9.2.5 Time and Timers for Loadable Schedulers

Schedulers need to be able to accurately determine the current time, and also to arrange to regain control of the CPU at a specific time in the future.

Reading the current time is cheap and accurate on Pentium-class x86 processors; the rdtsc instruction returns the number of cycles since the machine was turned on.3 One division and one addition are required to turn this into a Windows 2000 filesystem time (64-bit integers in 100 ns units representing either a time relative to the present or relative to Jan 1, 1601).

The HSI uses Windows kernel timers to get control of a CPU at a particular time. Kernel timers call a user-supplied routine in the context of a DPC. DPCs, like other scheduler operations, run at dispatch IRQL. When the HSI receives a timer callback, it acquires the dispatcher database lock before calling the TimerCallback function of the scheduler whose timer expired. Instead of using a single kernel timer and dispatching expirations to the correct scheduler, the HSI simply allocates a separate kernel timer for each loadable scheduler instance.

The precision of kernel timers is limited by the granularity of the clock interrupts on a particular system. By default, clock interrupts arrive every 10-15 ms, depending on which hardware abstraction layer (HAL) is being used. This granularity is too coarse for some multimedia applications. On some HALs, Windows 2000 allows the timer interrupt frequency to be changed at run-time. The HSI arranges at boot time for the timer frequency to be 1024 Hz, the highest frequency supported by the released Windows 2000 distribution. By modifying the HAL used for multiprocessor PCs, HLS is able to drive the real-time clock (RTC)--hardware device that this HAL uses to generate clock interrupts--at up to 8192 Hz. For the work reported in Chapter 11 it was run at 4096 Hz in order to make timer expirations precise to within about 0.25 ms. This is a fair approximation of a precisely settable timer--a feature that would be found in any OS specifically designed to support dynamic real-time scheduling.

The presence of a periodic clock interrupt instead of a precisely settable timer in Windows 2000 is typical of time-sharing operating systems. Periodic clocks are used in these OSs because (1) GPOSs were not designed to support real-time applications requiring precise timing facilities, and (2) limiting the clock granularity leads to the batching up of timed events, which tends to increase system throughput by limiting the number of times the running application is preempted and has the cache polluted.

9.3 Schedulers Implemented

Schedulers implemented for HLS include a time-sharing scheduler, a fixed-priority scheduler, a proportional share scheduler, and a real-time scheduler that provides basic, hard CPU reservations. Each of these schedulers was chosen because it is representative of a broad class of schedulers that have been used to schedule multimedia applications. Using these schedulers as basic building blocks, complex scheduling behaviors can be built, as Section 6.2 and Chapter 7 showed.

9.3.1 Time Sharing

The HLS time-sharing scheduler is a multiprocessor priority and round-robin based scheduler that is very much like the native Windows 2000 scheduler. It has several capabilities that the native scheduler does not, however. First, it is a hierarchical scheduler, meaning that it correctly handles processor revocation. Second, it can be configured to schedule any number of processors, not just the number of processors that are physically present on the system. And third, in addition to being a round-robin scheduler it can be configured to act as a non-preemptive fixed-priority scheduler.

There are also several features not supported by the HLS time-sharing scheduler that are supported by the native Windows 2000 scheduler. First, the HLS time-sharing scheduler does not support variable sized scheduling quanta, except at compile time. Second, it makes no attempt implement processor affinity, to keep threads running on the processor that they had previously been running on. Third, it does not support affinity masks that restrict a thread, or all threads in a process, to run on a subset of the full set of processors. Support for affinity masks could be added to the HLS time-sharing scheduler in a straightforward way, but we did not do this because standard time-sharing and multimedia applications do not appear to make use of them.

A final difference between the two schedulers is that the HLS time-sharing scheduler maintains the invariant that a thread will never be in the ready state while a lower priority thread is in the running state. On a multiprocessor machine the native Windows 2000 scheduler does not make this guarantee: to a certain extent it elevates processor affinity above the requirement to always run the set of tasks that have the highest priorities.

9.3.2 Rez

Rez is a scheduler that provides basic, hard CPU reservations. The algorithm it uses is similar to a number of other reservation schedulers that have been described in the literature such as the constant utilization server developed by Deng et al.  [18], the constant bandwidth server that Abeni and Buttazzo developed  [2], and the Atropos scheduler developed for the Nemesis OS  [49].

Rez assigns a budget to each thread that has a CPU reservation. Budgets are decremented in proportion to the CPU time allocated to the associated thread, and are replenished at the beginning of each period. Rez always schedules the thread that has the earliest deadline among all threads that are runnable and have a positive budget. The deadline for each thread is always taken to be the end of its current period.

Rez is a uniprocessor scheduler in the sense that it can only schedule one CPU worth of reservations, but it can be run on multiprocessors by instantiating an instance for each processor.

9.3.3 Proportional Share

The hierarchical proportional share (PS) scheduler implements the start time fair queuing (SFQ) algorithm, with a warp extension similar to the one implemented in BVT  [20]. However, the PS scheduler does not provide support for the warp time limit and unwarp time requirement that BVT supported. Each thread is assigned a warp value (defaulting to zero) that allows threads to borrow against their future processor allocation, providing a mechanism for low dispatch latency for time-sensitive threads.

The PS scheduler can also be compiled without support for warp, making it amenable to the throughput and latency bounds for SFQ that were described by Goyal et al.  [28]. Each SFQ thread has an associated virtual time that increases in proportion to the time the thread runs and in inverse proportion to the thread’s share. Threads that block, upon awakening, are forced to “catch up” with the virtual time of the currently running thread, meaning that blocked threads do not build up credit. The PS scheduler always dispatches the thread with the smallest virtual time among all runnable threads.

The proportional share scheduler currently interprets thread priorities as proportions. In other words, an application running at the default priority (8) would receive eight times the amount of processor time received by a thread running at the idle priority (1). This was an expedient way to make use of the existing infrastructure for manipulating priorities, and to provide applications that are not aware of the PS scheduler with scheduling behavior that approximates what they would have received under the native Windows 2000 scheduler.

9.3.4 Join

The function of most schedulers is to multiplex a physical processor, or part of a physical processor, among many virtual processors. Join schedulers have the opposite purpose: to schedule a single virtual processor whenever any one of its parents schedules it.

On a multiprocessor, there is the possibility for multiple parents to grant physical processors to a join scheduler at the same time. To handle this case, the join scheduler must have a policy for deciding which processor to use. A simple priority scheme is likely to be effective in most cases, and is the policy implemented in the HLS join scheduler. To implement the priority scheme, the join scheduler numbers its parents and always accepts scheduling from the highest-numbered parent that will grant a processor to it.

9.3.5 Top

The top scheduler allows only as many virtual processors to register for scheduling as there are physical processors in the system. It always immediately grants each request made to it, and furthermore, always grants the same physical processor to each virtual processor (unless the VP unregisters and re-registers).

9.3.6 Bottom

The bottom scheduler is responsible for actually dispatching threads and for converting thread events into HLS notifications. When a thread event occurs, it is desirable to quickly find the associated bottom scheduler instance. To this end a pointer to the bottom scheduler instance was added to the ETHREAD structure (the “executive thread block” that Solomon and Russinovich  [77, pp. 317-327] discuss). Adding fields to kernel internal data structures in Windows 2000 is risky because some structures are referenced from assembly language routines that use hard-coded offsets to locate particular structure elements. However, adding a pointer to the end of the executive thread block did not appear to cause any problems. The executive thread block is the only Windows 2000 kernel data structure that was modified while implementing HLS.

9.4 HLS Implementation Experience

9.4.1 A Simulation Environment for HLS

The hierarchical schedulers and infrastructure can be compiled to run either in the Windows 2000 kernel or in a user-level application as part of an event-driven simulation. The simulation environment models just enough of the Windows 2000 kernel environment for the schedulers to run. It models threads as programmable state machines, allowing them to perform arbitrary scheduling actions while the simulation is running. For a number of reasons, the simulation was an invaluable tool during the development of HLS.

The disadvantage of the simulation environment was the time it took to implement it. However, this was far outweighed by the advantages. In particular, randomized testing appeared to be very good at finding corner-cases that were difficult to think of while implementing the code--the amount of state in a hierarchy containing several schedulers plus the machine state and HSI state made the behavior of the entire system too complicated to easily reason about. In general, either randomized testing found a bug within a few minutes, or failed to find any bugs at all (which meant not that HLS was bug free, but that it happened to work under a particular simulated workload).

9.4.2 Code Size Results

In addition to the modifications to Windows 2000 listed in Table 9.2 , implementing the HSI required adding about 3100 lines of code to the Windows 2000 kernel. The line counts for the schedulers are as follows: 1176 for the time-sharing scheduler, 1058 for Rez, 765 for the proportional share scheduler, 477 for the join scheduler, 218 for the top scheduler, and 392 for the bottom scheduler. These line counts include a substantial amount of comments and debugging code. For example, comment lines, blank lines, and debugging code accounted for 413 lines of the proportional share scheduler; after removing them only 352 lines remained.

9.4.3 Developer Effort Results

This section presents some anecdotal evidence about the effort required to implement various loadable schedulers. The top and bottom schedulers are part of the scheduler infrastructure, and the time-sharing scheduler was debugged concurrently with the HSI, so the time taken to develop these is not relevant. It took about a day of coding to write a version of Rez that provided one CPU reservation, and another day to extend it with EDF scheduling to handle an arbitrary number of reservations. Given Rez as a working model of a time-based, root-only uniprocessor scheduler, implementing a root-only version of the proportional share scheduler took only a few hours; a few more hours were required to extend it to correctly handle processor revocation. Writing the join scheduler, in addition to some glue code that provides soft CPU reservations using the join scheduler in conjunction with Rez, took about five hours. Finally, the simulation environment was nearly always sufficient to debug loadable schedulers--additional problems were only rarely uncovered while running them on a real machine.

9.4.4 Lessons Learned

The feasibility of implementing the HSI in operating systems other than Windows 2000 has not been investigated in detail. However, given the similarities among thread abstractions and time-sharing schedulers in general-purpose operating systems, it seems likely that the hierarchical scheduler architecture could have been developed in a multiprocessor operating system such as Linux or FreeBSD with about the same effort as it took to implement it in Windows 2000. Furthermore, since the HSI and loadable schedulers are now debugged and known to work, the HSI could most likely be ported to one of these operating systems relatively easily. Porting the HSI to an OS such as Solaris or IRIX that supports per-processor scheduling queues may be considerably more difficult, since the scheduling model in these OSs does not closely correspond to the HLS model.

The scheduling structure of a typical general-purpose OS is a cross-cutting aspect--a feature of the code that does not happen to be cleanly separated from unrelated code because it was not designed to be changed  [43]. Implementing HLS can be viewed as untangling the scheduling aspect from other low-level kernel code.

9.5 Conclusion

This chapter has described the implementation of the hierarchical scheduler infrastructure and several hierarchical schedulers in Windows 2000. The important property of the HSI is that it makes it easier to implement new schedulers by exposing useful scheduling primitives while hiding irrelevant details about the operating system. The schedulers that have been implemented for HLS--a fixed priority scheduler that also serves as a time-sharing scheduler, a proportional share scheduler, a reservation scheduler, and a join scheduler--are representative of the kinds of schedulers typically used to schedule time-sharing and soft real-time applications on general-purpose operating systems.