OSDI '94 Summary Report from ;login


by Dave Mitchell
dwm@orca.com

The first OSDI Symposium was held at the Monterey Conference Center in Monterey, CA USA on November 14-17, 1994. This three-in-one symposium was sponsored jointly by USENIX, the ACM, and the IEEE Computer Society, replacing three earlier conferences on operating systems: the Mach Symposium, Microkernels and Other Architectures, and SEDMS.

The symposium was well attended, with 370 people registered. Monday consisted of six half-day tutorials, covering the architecture and internals of Spring, the GNU Hurd, and Chorus, as well as the use of Isis and Horus for distributed computing, the x-kernel for networking, and distributed shared memory for parallel computing. Out of 178 submissions, 21 papers were chosen to fill the bulk of the next three days of technical presentations. A workshop on Mach and Chorus was held the last afternoon, and a variety of BOF and works-in-progress sessions filled every spare moment in between. The conference was well-received and fast-paced, driven by the energy and enthusiasm of Jay Lepreau, the program chair from the U. of Utah.

Tuesday, November 15

David Patterson (UC Berkeley) gave a keynote address that was full of good advice and hilarious in its sarcastic, backward approach: "How to Have a Bad Career in Research/Academia". He ran through a list of bad career moves. The first advocated being THE leading expert in a field by inventing a new one, whose payoff is at least 20 years away so you can use it to support most of your career. This allows you to avoid working with others so you won't have to share the credit - a prima donna personality is helpful. For an example, Patterson offered up the insight that computer security is increasing in importance, and capability based systems got the focus nearly right; a useful new idea would be to create a list of all the things a process CANNOT do.

The key initial research question of disability-based systems would of course be whether you should store disabilities with each user, or with the objects that they can't access. Encrypted disabilities and disability-based addressing were further fertile topics. Another area ripe for investigation in the OS arena was "Omni-Femtokernels" -- omnipresent in everything from VCRs to automobiles, and femto, as in a micro-kernel only more so. Its key contribution would be to provide employment to OS researchers when everybody is running DOS.

Patterson's next bad tip was to let complexity be your guide. This will confuse your competitors, make it easier for you to claim credit for subsequent good ideas, and easier to publish the N+1st incremental change, since nobody else will understand your points well enough to criticize them. Further advice for a bad career included several ways to avoid being proven wrong: uphold the OS tradition of promising "it will be working soon" to avoid actually implementing, avoid benchmarks and quantitative experiments in favor of hand-waving intuitive positions, and choose projects with payoffs of more than 20 years, since that will give you 19 safe years.

This advice continued on, noting that avoiding feedback is a great way to keep from being distracted by others, and stating that publishing papers is technology transfer. Since publishing is the key to success, legally changing your name to Aaaanderson will keep you at the top of everybody's reference list, and the LPU (least publishable unit) is your friend: one good idea spawns four journal papers, which beget 16 extended abstracts, and then 64 tech reports. The first half of the keynote finished by going over some tips for bad writing, giving bad talks, and making bad slides (this last section was cause for more than a little giggling in the audience during later presentations).

For the second part of the keynote, Patterson outlined what he recommended as a positive approach. His primary goal is to aim to have an impact, and having many short projects gives more chances for impact, as does choosing to focus on "real stuff", or problems that somebody cares about. Getting feedback is key; seeking critics provides a valuable reality check. Technology transfer requires convincing others, and the missionary work required extends far beyond merely publishing, since industry is reluctant to change and you need one bold company to both take a chance on your idea \fIand\fP be successful. This might seem like a large portion of motherhood and apple pie, but his plug for quantitative scientific methods instead of varying everything at once got a definite and positive response from the audience.

Carl A. Waldspurger gave the first talk, "Lottery Scheduling: Flexible Proportional-Share Resource Management," which was awarded Best Paper by the program committee, and also prompted a large amount of hallway conversation. This work was motivated by the desire to provide flexible, responsive control over service rates in multithreaded systems, so that quality of service guarantees could be provided to applications ranging from long-running background computations through interactive applications to networked multimedia displays requiring timely response. Most schedulers in use today rely on the notion of priority, either usage-decayed or fixed. The assignment of these priorities and adjustment policies are poorly understood and largely ad hoc.

Existing fair share schedulers are typically coarse-grained and attempt to mediate resource usage among groups of processes. They fail to provide the dynamic, rapid control of scheduling decisions, at a time scale of fractions of a second, that is needed to manage a range of applications with conflicting requirements. Lottery scheduling is a randomized mechanism that implements proportional share resource management, where the consumption rates of active processes are proportional to the relative number of shares they are allocated. At the time of each scheduling decision, a random number within the range of allocated tickets is chosen, and the holder of the corresponding ticket is run. Thus the probability that a given task will be scheduled is proportional to the number of tickets it holds, and equal to that number divided by the total number of active tickets. Any task with a non-zero number of tickets will eventually win a lottery, so starvation isn't a problem.

Since the probabilistic aspects of random lotteries are well understood, it is easy to reason about the behaviors of this method of scheduling. The paper discusses the details of implementation, the room remaining for optimization, and briefly discusses applying this technique to non-CPU resources such as mutex locking. These lottery tickets can also be temporarily transferred during various blocking operations, allowing servers to provide good response without having any tickets of their own. This method also prevents priority inversion problems by increasing the probability of running a piece of code holding a mutex with pending waiters. Somebody with a real-time background raised the issue of the lack of a guaranteed bound on scheduling variance in the face of hard real-time requirements; this was countered with a response considered heretical by hard-real-time folks, "CPUs are getting faster, we can decrease the scheduling quantum".

The second talk, by Brent Welch (Xerox PARC) was "Scheduling for Reduced CPU Energy". He discussed ways to decrease the amount of energy used by battery-operated computers which go beyond simply spinning the disk down during idle periods. That method is necessarily coarse grained, given the spin-up latencies of disks. The energy consumption of a chip is proportional to the square of the voltage used. Chips can often be made to run at lower voltages, but this increases settling time, requiring slower clocking. The potential benefits come from noticing that, with a non-zero amount of idle time and perfect knowledge of the future, energy could be saved by turning the voltage and clock down so that the computation was stretched out to fill the time which would otherwise have been idle, running the same number of useful cycles, but at a lower total energy cost due to the non-linear relationship between CPU speed and power consumption. A number of algorithms were examined and compared to the best savings possible given perfect knowledge of the future.

Fred Douglis gave the last talk of the morning, "Storage Alternatives for Mobile Computers". His group used hardware measurements and trace-driven simulation to survey the tradeoffs between magnetic disk, flash memory cards, and flash disk (flash memory with a disk-like interface). Disk consumes an order of magnitude more energy even with active power management, while flash memory (with either interface) offers low energy consumption, good read performance, and acceptable write performance. The surprise in this paper was learning just how much worse flash memory performs at higher levels of utilization, where the need for erasure drastically impacts both write throughput and energy consumption.

After lunch, the first three talks concerned file systems. Liuba Shrira (MIT) started off the session by presenting "Opportunistic Log: Efficient Installation Reads in a Reliable Storage Server". In a distributed store, client caches can out-perform page-based caches when organized as caches of objects smaller than a page in size. The problem with this approach is that object servers must then perform additional disk reads to install modified objects into their containing pages. The opportunistic log is essentially a large object cache that reduces disk traffic by removing the installation reads from the commit path, and scheduling bulk updates more efficiently (further object fetches hit in the modified object cache). Opportunistic logging allows an object server to outperform a page server while using a much smaller cache than would otherwise be necessary.

Gregory R. Ganger (U. of Michigan) presented "Metadata Update Performance in File Systems". This focus of this paper was on solving the problems caused by the typical Unix practice of maintaining file system integrity by performing metadata updates synchronously. Conversion to a log-based file system is one option, while other systems use asynchronous writes and pass the write ordering requirements down to the disk scheduler. This latter group can outperform the more conventional approach by a third, but are still suboptimal since they can't safely use delayed writes when ordering is required. The new approach outlined is called soft updating, and performs nearly as well as a memory-based file system. This scheme tries to always use delayed writes, allowing multiple writes to be buffered and coalesced. This, however, requires that information on the inter-operation dependencies be kept; since there may be multiple dependencies within a single disk block, dependency information is kept around for each metadata update, not just for each block in the disk buffer. This information makes it possible to write a block back to disk at any time, by undoing (rolling back) any updates within that block that still have pending dependencies. This makes the block written to disk consistent with respect to the current on-disk data. When the write is completed, any undone updates are re-established, and the in-memory block can once again be accessed. This paper and the next received Honorable Mentions from the Program Committee.

The last file system talk was by David Kotz (Dartmouth College), who observes that even multiprocessor systems with large numbers of I/O processors often fail to provide the available hardware bandwidth to the application, which sits on top of the usual set of read/write/seek primitives organized around a coordinated file pointer. Parallel scientific applications are particularly problematic, since they tend towards complex strided access to discontiguous pieces of files, a very poor match for distributed buffer caches. Kotz's solution is to provide higher-level interfaces to the file system, passing the entire collective request down to the I/O processors, each of which can then optimize the handling of its portion of the operation. This scheme of "Disk-directed I/O for MIMD Multiprocessors" optimizes disk layout and scheduling, avoids problems of buffer management, and removes the guessing from prefetching and write-behind. This solution scales with the number of disks (modulo bus bandwidth) and is independent of the number of CPUs.

Tuesday's last set of talks centered on distributed shared memory (DSM). Robert J. Fowler (U. of Copenhagen) presented "Message-Driven Relaxed Consistency in a Software DSM". They have built a software DSM machine, but with relaxed, "aggressively lazy" consistency. Memory state is made coherent only when messages containing explicit annotations are exchanged. This method combines data transfer and synchronization, and helps compensate for any mismatch between the hardware-imposed granule for access detection and the optimal size from the point of view of the abstractions of the application. Further benefit accrues from lazy evaluation, since only operations that explicitly require consistency incur the overhead to maintain it.

"Software Write Detection for DSM" was presented by Matthew J. Zekauskas (Carnegie Mellon U.). Most implementations of DSM trap accesses and extend the virtual memory software to maintain consistency between competing application threads. The access faults used by these traditional methods are quite expensive, requiring many accesses to amortize the overhead. False sharing, when many objects fit into a single, large-grained page, also reduces efficiency. The approach taken here is to use the compiler and run-time support to detect and collect writes to shared data, and to trigger the consistency mechanisms. The benefits are keeping the operating system uninvolved in consistency operations, eliminating false sharing while directly supporting variable sized objects, and providing a detailed update history, thus minimizing the amount of data that needs to be transferred to maintain a consistent view of memory.

"The Design and Evaluation of a Shared Object System for Distributed Memory Machines", presented by Daniel J. Scales (Stanford U.), outlines another software implementation of DSM built upon message passing. This method, like that of the previous talk, lets synchronization piggy-back on data messages; in addition, the user has to explicitly tag shared data as one of two types. Accumulators allow fine grained updates with built in mutual exclusion; single-assignment variables cause potentially many readers to block until the single writer has supplied the value, supporting producer/consumer relationships. This portable run-time system is claimed to improve communication efficiency in computations with complex access patterns, while providing a simple and easily-understood data sharing model.

WEDNESDAY, NOVEMBER 16

The first talk on Wednesday was "PATHFINDER: A Pattern-Based Packet Classifier", delivered by Michael A. Pagels (U. of Arizona). Packet filters are code fragments that select packets of interest to a particular application, based on the contents of selected fields in the packet header. This project built a pattern-based mechanism which, given the pattern description, results in a decision tree able to classify incoming packets and tag them, i.e., to identify the path along which the packet will be passed. Patterns are described by tuples consisting of an offset (within the packet header), the length of the packet datum, a mask to be applied to the datum, and the value needed to produce a match. This mechanism was built in both software and hardware. The software version proved to be roughly twice as fast as the fastest (Mach) packet filter measured. The first FPGA hardware prototype, despite limited pattern store, can sustain 622Mbps speeds and can be expected to sustain gigabit speeds using existing technology.

Erich M. Nahum (U. of Massachusetts) talked about "Performance Issues in Parallelized Network Protocols", using a parallel version of the x-kernel running in user space to examine the effects of choice of locking strategy on performance. His group found that a coarse-grained locking approach got better performance than finer-grained locking. Further, since packet ordering significantly impacts performance, where lock contention perturbed the ordering a simple FIFO queueing lock improved throughput substantially. They found that single-connection TCP parallelism was limited by the need to lock the connection state; while proposing the use of multiple connections to gain back the lost parallelism, they noted that this implied punting the responsibility for managing ordering among connections up to the application. Support for atomic increment (in lieu of lock-increment-unlock sequences) was shown to gain 10-20%. Finally, they got some speedup from keeping a small cache of processor-local data structures that are frequently handled by the x-kernel.

Ronald C. Unrau (U. of Toronto) described his group's "Experiences with Locking in a NUMA Multiprocessor Operating System Kernel". They have arrived at a hybrid locking scheme, using both fine and coarse grained locking in an attempt to get the highly concurrent behavior of the first with the low latency and space overhead of the latter. This approach uses a coarse lock to protect several data structures, which can only be held for short periods of time. Finer grained locks protect individual structures, and are held for longer periods, but taken under the coarse lock. This approach complicates the locking protocols, but reduces second order effects such as memory and bus contention, while exponential backoff mitigates so-called thundering herd effects. A technique called hierarchical clustering replicates data that is primarily read-shared to bound the contention on shared structures by constraining the number of CPUs that can attempt access.

Chao-Hsien Lee (National Chiao Tung U.) presented "HiPEC: High Performance External Virtual Memory Caching", talking about the work his group has done to allow each application customized control over the page replacement policy underlying its virtual memory support, in order to reduce thrashing. An application programs its own custom policy in the HiPEC command set, keeping the resulting table in user space. When a page fault occurs, the kernel interprets the application's policy and resolves the fault in accordance with that policy. Applications using this mechanism also have a private page frame list, to avoid interference between concurrent applications using differing policies. The actual policy information is interpreted from within the kernel to avoid the added domain crossing overhead which would have resulted from passing control out to the application on each fault. Their results showed little added overhead and the possibility to significantly improve performance for memory-intensive applications with well-understood access patterns.

Pei Cao (Princeton U.) spoke about "Implementation and Performance of Application-Controlled File Caching", which, like the previous talk, allowed user-space applications to gain control over the replacement policy used in maintaining their (process-private) file buffer cache. Their approach applies the notion of working set to file caching. A two-level mechanism first associates a priority each file, and then associates a policy with each priority. The kernel allocates cache blocks to processes; within a process, the lowest priority set of blocks is chosen for replacement, and the policy chooses blocks within that set. There are also mechanisms to temporarily change the priority of a range of blocks within a file, and to ensure that foolish or malicious processes can not hurt other processes. As an example, this scheme allows a database to set its index files to be a higher priority than the data files, keeping them around for future re-use, while selecting whatever policy is appropriate for each set.

David R. Cheriton (Stanford U.) presented "A Caching Model of Operating System Kernel Functionality", which is the supervisor-mode component of the V++ operating system. This effort seems to be another attempt to put the "micro" back in micro-kernel. The approach taken is to build a minimal layer which caches threads and address spaces the way that hardware caches the contents of memory. User-mode application kernels (read: servers) handle the loading and storing of these objects, and implement application-specific management policies and mechanisms (read: OS personalities). As with previous micro-kernel projects, the result is promised to be faster, smaller, more scalable, modular and robust than all prior efforts. Good use is made of "address-valued signals", similar to Unix signals but with a pointer-sized value, which provides asynchronous messaging built on top of shared memory.

The last slice of Wednesday afternoon was saved for a panel discussion, whose topic was "Radical OS Structures for Extensibility". Five panelists spent a few minutes advocating their own approach to building operating systems. Brian Bershad (U. of Washington) talked about Spin, Larry Peterson (U. of Arizona) touted Scout, and Frans Kaashoek (MIT) peddled Aegis. David Cheriton advocated the caching kernel approach of V++, while Steve Lucco (CMU) plugged software fault isolation. The jokes were first-rate, and it seemed that the panelists put as much effort into bashing the competition as advocating their own work. The panel was quite enjoyable, and if DOS finally comes to rule the world, any of these guys could find work in standup comedy, at least for audiences of OS developers. Unfortunately, the technical information content was practically nil, with the usual set of buzzwords (smaller, faster, more robust!) used to justify all of the efforts, and kernels shirking ever more responsibility off to applications. The conference proceedings do, however, contain single-page summaries from each panelist, with enough content to be worth reading.

THURSDAY, NOVEMBER 17

Thursday began with another set of papers on distributed shared memory. David K. Lowenthal (U. of Arizona) talked about "Distributed Filaments: Efficient Fine-Grain Parallelism on a Cluster of Workstations". Filaments are a light-weight abstraction, conceptually just a stateless thread that is given some arguments and a function to execute. Filaments come in three flavors, each appropriate for a different sort of computation. Run-to-completion filaments execute once and then terminate, useful in applications such as matrix multiplication. Iterative filaments execute repeatedly, performing a barrier synchronization after each execution of all the filaments; a Jacobi iteration could use these. Fork/join filaments recursively fork off new filaments and wait for them to complete, useful in divide-and-conquer solutions such as adaptive quadrature. Filaments have no private stack (they are executed by a server thread that does have one) and are blocked along with the underlying server thread during fault processing. Multiple server threads are created per processor, so that another filament may be run while the block thread awaits the page. This package is portable, and implements multi-threaded DSM on a variety of hardware, including clusters of workstations and at least one multicomputer.

"Integrating Coherency and Recovery in Distributed Systems", presented by Michael J. Feeley (U. of Washington), is a new twist on the usual DSM, meant to support collaborative design efforts rather than parallel programming. The result is a log-based coherency mechanism layered over a persistent store, where the logs, as in the database world, provide both recoverability and coherency. This method separates the granularity of synchronization from that of updates and invalidations, performing particularly well when the size of update is smaller than the underlying virtual memory page. The logs might be viewed as a collection of distributed caches, implementing DSM for the applications.

Paulo Ferreira (INRIA and Universite Pierre et Marie Curie), presented Garbage Collection and DSM Consistency, a design which minimizes cross-node communication to provide efficient garbage collection over DSM. This project is based on the observation that, in a weakly consistent DSM system, the memory consistency requirements of the garbage collector are less strict than those of the applications. Thus the garbage collector reclaims objects independently of other copies of the same objects, without interfering with the DSM consistency protocol. Distributed cycles of dead objects are scavenged, and reliable communication support is not assumed.

The final technical session had three talks on memory management. The first, "Software Prefetching and Caching for Translation Lookaside Buffers", was by Kavita Bala (MIT). The number and cost of software TLB refills was reduced by keeping a cache of TLB entries, and by prefetching the entries needed by a process involved in a messaging RPC: those for the PC, stack pointer, and message buffer(s). The cache of TLB entries provided a hot path that handled over 90% of all TLB misses, and the prefetching prevented multiple traps resulting from a single attempted system call. Cascading TLB misses were also eliminated, though this result is less general, since it's an artifact of chips which both handle TLB misses in software and use virtually mapped page tables.

"Dynamic Page Mapping Policies for Cache Conflict Resolution on Standard Hardware", presented by Theodore H. Romer (U. of Washington), tackles the problem of cache conflict misses, both with hardware and with a software approximation of that hardware. On hardware with large, physically-indexed direct-mapped caches, a conflict miss occurs when two memory locations compete for the same cache line, even though the cache may be large enough to hold the current working set. Most approaches to solving cache conflicts involve static mapping policies, which assign a page frame to a virtual page at page-in time, on the basis of spatial locality (page coloring) or temporal locality (bin hopping). The problem inherent in these approaches is that access patterns may vary over time, making any static solution sub-optimal. A variety of mapping policies were simulated, showing that a dynamic mapping policy that adjusts to the ongoing behavior of the program outperforms any static policy for some workloads. Further, the best performance can be had by a simple hardware enhancement called the Cache Miss Lookaside Buffer, which tallies the misses of various physical page frames, allowing pages mapping to the same cache line to be moved if the number of misses becomes excessive.

"Cooperative Caching: Using Remote Client Memory to Improve File System Performance", presented by Michael D. Dahlin (U. of CA - Berkeley), examined the benefits of workstations and shared servers working together to provide a larger effective file cache. The simulations, based on traces from various workloads, showed that relatively simple cooperative algorithms can half the number of disk accesses and improve read response time by over 70% in typical workstation cluster configurations. These results become even stronger as we follow current trends toward faster networks and CPUs (relative to disks).

The final afternoon session was devoted to a Mach/Chorus workshop, where a half dozen informal presentations were made. Papers corresponding to these are not included in the conference proceedings, but hard copies (and pointers to online copies) were provided by the authors.

Frederic Ruget (Chorus Systemes and Universite Joseph Fourier) talked about Micro-Kernel Support for Trace-Replay, in which Chorus was instrumented to support tracing of a variety of events, including system calls, traps and asynchronous interrupts. This is intended to provide support for statistics gathering, profiling and tuning, visualization and understanding of system behaviors, as well as allowing for testing and fault injection.

Dejan Milojicic (OSF Research Institute) spoke about his work on "Concurrent Remote Task Creation in Mach". He found a variety of problems that caused remote task creation to scale up poorly, ranging from slow remote IPC, through pointless paging directed at a single node, to inefficient VM operations resulting in ever-increasing shadow chain length. Copy-on-write, a mainstay of Mach VM, is sub-optimal except in the local case; copy-on-reference results in much less network traffic in distributed task creation. Further work will use a tree-structured mechanism to reduce the bottleneck of sequential creation requests when large numbers of nodes are involved.

Michael Condict (OSF Research Institute) spoke about the success of his project in attaining "Microkernel Modularity with Integrated Kernel Performance". The thrust of this work was to load servers into supervisor mode once they're sufficiently stable to be trusted, and to make RPC take advantage of this "collocation" by automatically using the most efficient mechanisms permitted by location of sender and receiver, since crossing protection domains is expensive. A variety of lesser performance problems were also corrected, to produce a micro-kernel system that performs on a par with its monolithic equivalent.

"Operating System Support for Coexistence of Real-Time and Conventional Scheduling" was presented by David B. Golub (Carnegie Mellon U.). The usual approach to this problem is to have many priorities, but divided into segments, with one set treated specially to provide some semblance of real-time response. He noted that both rate-monotonic and earliest deadline first policies have some problems, so more flexibility is needed. His approach is to associate with each priority a scheduling policy, and to provide per-policy scheduling parameters. When a scheduling decision is to be made, all threads in a higher priority beat those in a lower priority, and the policy selects threads within the priority, using the scheduling parameters. The processor set and priority interfaces were extended to accommodate these changes.

"A Memory Management Mechanism and An External Resource Manager Interface for Continuous Media Objects" was presented by Satoshi Moriai (Nippon Telephone and Telegraph Corporation). This project addressed the needs of distributed multimedia to allow dynamic control of the quality of service guaranteed to an application handling continuous multimedia requests. The focus was on supplying predictable VM performance for two sorts of real-time behaviors. In temporal paging, the active region of VM migrates over time, while in real-time projection, the contents of a fixed region change continually over time. Several techniques were discussed, which included wiring memory, sharing memory between devices and applications, as well as a virtual ring-buffer with timestamp-based pageout and the use of page swapping to guarantee page availability as old pages are discarded.

Clifford W. Mercer (Carnegie Mellon U.) spoke "On Predictable Operating System Protocol Processing". He talked about the need for timely resource management, and the need to distinguish time-constrained packets. The general approach taken seemed to be resource reservation, including network bandwidth, CPU time for both the protocol and the application, as well as reserving the "attention" of the server (which shouldn't be single-threaded).