StudyWeb Award

University of Utah
Department of Computer Science


Avalanche Scalable Parallel Processor Project


Publications

ASCOMA: An Adaptive Hybrid Shared Memory Architecture

(ICPP98 -- Aug 1998)

Abstract

Scalable shared memory multiprocessors traditionally use either a cache coherent non-uniform memory access (CC-NUMA) or simple cache-only memory architecture (S-COMA) memory architecture. Recently, hybrid architectures that combine aspects of both CC-NUMA and S-COMA have emerged. In this paper, we present two improvements over other hybrid architectures. The first improvement is a page allocation algorithm that prefers S-COMA pages at low memory pressures. Once the local free page pool is drained, additional pages are mapped in CC-NUMA mode until they suffer sufficient remote misses to warrant upgrading to S-COMA mode. The second improvement is a page replacement algorithm that dynamically backs off the rate of page remappings from CC-NUMA to S-COMA mode at high memory pressure. This design dramatically reduces the amount of kernel overhead and the number of induced cold misses caused by needless thrashing of the page cache. The resulting hybrid architecture is called adaptive S-COMA (AS-COMA). AS-COMA exploits the best of S-COMA and CC-NUMA, performing like an S-COMA machine at low memory pressure and like a CC-NUMA machine at high memory pressure. AS-COMA outperforms CC-NUMA under almost all conditions, and outperforms other hybrid architectures by up to 17% at low memory pressure and up to 90% at high memory pressure.

Shared Memory as a Basis for Conservative Distributed Architectural Simulation

(April 1997)

Abstract

This paper describes experience in parallelizing an execution-driven architectural simulation system used in the development and evaluation of the Avalanche distributed architecture. It reports on a specific application of conservative distributed simulation on a shared memory platform. Various communication-intensive synchronization algorithms are described and evaluated. Performance results on a bus-based shared memory platform are reported, and extension and scalability of the implementation to larger distributed shared memory configurations are discussed. Also addressed are specific characteristics of architectural simulations that contribute to decisions relating to the conservatism of the approach and to the achievable performance.

Efficient Communication Mechanisms for Cluster Based Parallel Computing

(December 1996)

Abstract

The key to crafting an effective scalable parallel computing system lies in minimizing the delays imposed by the system. Of particular importance are communications delays, since parallel algorithms must communicate frequently. The communication delay is a system-imposed latency. The existence of relatively inexpensive high performance workstations and emerging high performance interconnect options provide compelling economic motivation to investigate NOW/COW (network/cluster of workstation) architectures. However, these commercial components have been designed for generality. Cluster nodes are connected by longer physical wire paths than found in special-purpose supercomputer systems. Both effects tend to impose intractable latencies on communication. Even larger system-imposed delays result from the overhead of sending and receiving messages. This overhead can come in several forms, including CPU occupancy by protocol and device code as well as interference with CPU access to various levels of the memory hierarchy. Access contention becomes even more onerous when the nodes in the system are themselves symmetric multiprocessors. Additional delays are incurred if the communication mechanism requires processes to run concurrently in order to communicate with acceptable efficiency. This paper presents the approach taken by the Utah Avalanche project which spans user level code, operating system support, and network interface hardware. The result minimizes the constraining effects of latency, overhead, and loosely coupled scheduling that are common characteristics in NOW-based architectures.

Support for Distributed Shared Memory in Avalanche

(ASPLOS'96 Workshop -- October 1996)
(Ghostview incompatible, yet printable postscript slides available here.)

Abstract

This talk concentrated on the unique design features present in Avalanche for supporting scalable shared memory, including support for both the Simple COMA and CC-NUMA architectures, support for multiple coherency protocols, and some low-level design optimizations. The talk also discusses a number of challenges that we faced in using off-the-shelf hardware over which we had little to no control. Detailed implementation issues and example operations are included to illustrate the DSM architecture, and preliminary simulation-derived results are presented.

Myrinet Simulation Package (Version 2.0)

(June 1996)

Abstract

Version 2.0 of the Myrinet simulation package was designed to allow a high degree of configurability of the modeled network. Version 1.0 modeled only simple square mesh topologies with 4-port switches, and users could specify only a limited number of switch parameters. As Myricom released larger and faster versions of their Myrinet switches, the V1.0 simulation model became obsolete. Users of the V2.0 package can specify arbitrary network topologies composed of Myrinet switches with different number of ports. For example, 4-port and 32-port switches can be used in a single system. Because the V2.0 model supports arbitrary topologies, simple X-then-Y source routing is no longer sufficient to model the required routing. Thus, users of the V2.0 package must specify the routing table themselves. In addition, to track improvements to the circuit technologies used in the Myrinet switches, the clock rate, latency and bandwidth have been parameterized. Users can change the parameters in order to meet their simulation needs.

Myrinet Simulation Package Source Code (235 KB)


PAINT: PA Instruction Set Interpreter

(March 1996)

Abstract

This document describes Paint, an instruction set simulator based on Mint. Paint interprets the PA-RISC instruction set, and has been extended to support the Avalanche Scalable Computing Project. These extensions include a new process model that allows multiple programs to be run on each processor and the ability to model both kernel and user code on each processor. In addition, a new address space model more accurately detects when a program is accessing an illegal virtual address, allows a program's virtual address space to grow dynamically, and does lazy allocation of physical pages as programs need them.

Note that this document is intended to be an addendum to the original Mint technical report, which the reader should consult for an overview of the Mint simulation environment and terminology. This preliminary version of the document is intended for people who plan to use Paint as a simulation engine, not modify it.

Click here to request the latest version of the PA-Int software package (~8 MB).


An Improvement to Partial Order Reductions

(March 1996)

Abstract

In this paper, we present a new partial order reduction algorithm that can help reduce both space and time requirements of automatic verifiers. The partial order reduction algorithms described in [God95, Hol94] (both incorporated in SPIN [Hol91]) were observed to yield very little savings in many practical cases due to the proviso in them. Our algorithm, called the two-phase algorithm, is different from these algorithms in that it avoids the proviso, and follows a new execution strategy consisting of alternating phases of depth-first-search and partial order reduction. The two-phase algorithm is shown to preserve safety properties. It has been implemented in a new verifier which, like SPIN, takes Promela as input. Comparisons between these algorithms and the two-phase algorithm are provided for a significant number of examples, including directory based protocols of a new multiprocessor where significant performance gains are obtained.

Message Passing Support in the Avalanche Widget

(March 1996)

Abstract

Minimizing communication latency in message passing multiprocessing systems is critical. An emerging problem in these systems is the latency contribution costs caused by the need to percolate the message through the memory hierarchy (at both sending and receiving nodes) and the additional cost of managing consistency within the hierarchy. This paper, considers three important aspects of these costs: cache coherence, message copying, and cache miss rates. The paper then shows via a simulation study how a design called the Widget can be used with existing commercial workstation technology to significantly reduce these costs to support efficient message passing in the Avalanche multiprocessing system.

Low Latency Workstation Cluster Communications Using Sender-Based Protocols

(January 1996)

Abstract

The use of workstations on a local area network to form scalable multicomputers has become quite common. A serious performance bottleneck in such ``carpet clusters'' is the communication protocol that is used to send data between nodes. We report on the design and implementation of a class of communication protocols, known as sender-based, in which the sender specifies the locations at which messages are placed in the receiver's address space. The protocols are shown to deliver near-link latency and near-link bandwidth using Medusa FDDI controllers, within the BSD 4.3 and HP-UX 9.01 operating systems. The protocols are also shown to be flexible and powerful enough to support common distributed programming models, including but not limited to RPC, while maintaining expected standards of system and application security and integrity.

A Comparison of Software and Hardware Synchronization Mechanisms for Distributed Shared Memory Multiprocessors

(November 1995)

Abstract

Efficient synchronization is an essential component of parallel computing. The designers of traditional multiprocessors have included hardware support only for simple operations such as compare-and-swap and load-linked/store-conditional, while high level synchronization primitives such as locks, barriers, and condition variables have been implemented in software. With the advent of directory-based distributed shared memory (DSM) multiprocessors with significant flexibility in their cache controllers, it is worthwhile considering whether this flexibility should be used to support higher level synchronization primitives in hardware. In particular, as part of maintaining data consistency, these architectures maintain lists of processors with a copy of a given cache line, which is most of the hardware needed to implement distributed locks.

We studied two software and four hardware implementations of locks and found that hardware implementation can reduce lock acquire and release times by 25-94% compared to well tuned software locks. In terms of macrobenchmark performance, hardware locks reduce application running times by up to 75% on a synthetic benchmark with heavy lock contention and by 3%-6% on a suite of SPLASH-2 benchmarks. In addition, emerging cache coherence protocols promise to increase the time spent synchronizing relative to the time spent accessing shared data, and our study shows that hardware locks can reduce SPLASH-2 execution times by up to 10-13% if the time spent accessing shared data is small.

Although the overall performance impact of hardware lock mechanisms varies tremendously depending on the application, the added hardware complexity on a flexible architecture like FLASH or Avalanche is negligible, and thus hardware support for high level synchronization operations should be provided.


PPE Interface and Functional Specification

(August 1995)

Abstract

This document describes the interface and functional specification of a Protocol Processing Engine (PPE) for workstation clusters. The PPE is intended to provide the support necessary to implement low latency protocols requiring only low resource (cpu and bus bandwidth) consumption.

Explicit-Enumeration Based Verification Made Memory-Efficient

(February 1995)

Abstract

We investigate techniques for reducing the memory requirements of a model checking tool employing explicit enumeration. Two techniques are studied in depth: (1) exploiting symmetries in the model, and (2) exploiting sequential regions in the model. The first technique resulted in a significant reduction in memory requirements at the expense of an increase in run time. It is capable of finding progress violations at much lower stack depths. In addition, it is more general than two previously published methods to exploit symmetries, namely scalar sets and network invariants. The second technique comes with no time overheads and can effect significant memory usage reductions directly related to the amount of sequentiality in the model. Both techniques have been implemented as part of the SPIN verifier.

Direct Deposit: A Basic User-Level Protocol for Carpet Clusters

(January 1995)

Abstract

This note describes the Direct Deposit Protocol (DDP), a simple protocol for multicomputing on a carpet cluster. This protocol is an example of a user-level protocol to be layered on top of the low-level, sender-based protocols for the Protocol Processing Engine. The protocol will be described in terms of its system call interface and an operational decription.

An Argument for Simple COMA

(January 1995)

Abstract

We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity - similarly to distributed virtual shared memory (DVSM) systems, leaving simpler hardware to maintain shared memory coherence at a cache line granularity.

By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all-hardware designs.


Case Studies in Symbolic Model Checking

(March 1994)

Abstract

Formal verification of hardware and software systems has long been recognized as an essential step in the development process of a system. It is of importance especially in concurrent systems that are more difficult to debug than sequential systems. Tools that are powerful enough to verify real-life systems have become available recently.Model checking tools have become quite popular because of their ability to carry out proofs with minimal human intervention. In this paper we report our experience with SMV, a symbolic model verifier on practical problems of significant sizes. We present verification of a software system, a distributed shared memory protocol, and a hardware system, the crossbar arbiter. We discuss modeling of these systems in SMV and their verification using temporal logic CTL queries. We also describe the problems encountered in tackling these examples and suggest possible solutions.

Reducing Consistency Traffic and Cache Misses in the Avalanche Multiprocessor

Abstract

For a parallel architecture to scale effectively, communication latency between processors must be avoided. We have found that the source of a large number of avoidable cache misses is the use of hardwired write-invalidate coherency protocols, which often exhibit high cache miss rates due to excessive invalidations and subsequent reloading of shared data. In the Avalanche project at the University of Utah, we are building a 64-node multiprocessor designed to reduce the end-to-end communication latency of both shared memory and message passing programs. As part of our design efforts, we are evaluating the potential performance benefits and implementation complexity of providing hardware support for multiple coherency protocols. Using a detailed architecture simulation of Avalanche, we have found that support for multiple consistency protocols can reduce the time parallel applications spend stalled on memory operations by up to 66% and overall execution time by up to 31%. Most of this reduction in memory stall time is due to a novel release-consistent multiple-writer write-update protocol implemented using a write state buffer.

Avalanche: Cache and DSM Protocol Design

Abstract

This note describes a prototype DSM protocol and directory cache controller design for the architectural simulation of a carpet cluster of nodes with modified PA7100 CPUs and two levels of cache. The design is functionally decomposed into three separate blocks: the Context Sensitive Cache Controller Unit, the Directory Controller, and the Network Controller. Four access protocols are supported by the design: local, migratory, write invalidate, and write shared.

Avalanche: A Communication and Memory
Architecture for Scalable Parallel Computing

This no longer represents the current architecture, but is an early overview of the project goals.

Abstract

As the gap between processor and memory speeds widens, system designers will inevitably incorporate increasingly deep memory hierarchies to maintain the balance between processor and memory system performance. At the same time, most communication subsystems are permitted access only to main memory and not a processor's top level cache. As memory latencies increase, this lack of integration between the memory and communication systems will seriously impede interprocessor communication performance and limit effective scalability. In the Avalanche project we are redesigning the memory architecture of a commercial RISC multiprocessor, the HP PA-RISC 7100, to include a new multi-level context sensitive cache that is tightly coupled to the communication fabric. The primary goal of Avalanche's integrated cache and communication controller is attacking end to end communication latency in all of its forms. This includes cache misses induced by excessive invalidations and reloading of shared data by write-invalidate coherence protocols and cache misses induced by depositing incoming message data in main memory and faulting it into the cache. An execution-driven simulation study of Avalanche's architecture indicates that it can reduce cache stalls by 5-60% and overall execution times by 10-28%.

Evaluating the Potential of Programmable
Microprocessor Cache Controllers

Abstract

The next generation of scalable parallel systems (e.g., machines by KSR, Convex, and others) will have shared memory supported in hardware, unlike most current generation machines (e.g., offerings by Intel, nCube, and Thinking Machines). However, current shared memory architectures are constrained by the fact that their cache controllers are hardwired and inflexible, which limits the range of programs that can achieve scalable performance. This observation has led a number of researchers to propose building programmable multiprocessor cache controllers that can implement a variety of caching protocols, support multiple communication paradigms, or accept guidance from software. To evaluate the potential performance benefits of these designs, we have simulated five SPLASH benchmark programs on a virtual multiprocessor that supports five directory-based caching protocols. When we compared the off-line optimal performance of this design, wherein each cache line was maintained using the protocol that required the least communication, with the performance achieved when using a single protocol for all lines, we found that use of the ``optimal'' protocol reduced consistency traffic by 10-80%, with a mean improvement of 25-35%. Cache miss rates also dropped by up to 25%. Thus, the combination of programmable (or tunable) hardware and software able to exploit this added flexibility, e.g., via user pragmas or compiler analysis, could dramatically improve the performance of future shared memory multiprocessors.
see also UUCS Technical Reports

This work was sponsored by the Space and Naval Warfare Systems Command (SPAWAR) and Advanced Research Projects Agency (ARPA), Communication and Memory Architectures for Scalable Parallel Computing, ARPA order #B990 under SPAWAR contract #N00039-95-C-0018.
Feedback to <avalanche@jensen.cs.utah.edu>.
Last modified around December 18, 1996.