University of Utah
Department of Computer Science
Avalanche Scalable Parallel Processor Project
(ICPP98 -- Aug 1998)
Abstract
Scalable shared memory multiprocessors traditionally use either a
cache coherent non-uniform memory access (CC-NUMA) or simple
cache-only memory architecture (S-COMA) memory architecture.
Recently, hybrid architectures that combine aspects of both CC-NUMA and
S-COMA have emerged.
In this paper, we present two improvements over other hybrid architectures.
The first improvement is a page allocation algorithm that prefers S-COMA
pages at low memory pressures.
Once the local free page pool is drained, additional pages are mapped in
CC-NUMA mode until they suffer sufficient remote misses to warrant
upgrading to S-COMA mode.
The second improvement is a page replacement algorithm that dynamically
backs off the rate of page remappings from CC-NUMA to S-COMA mode at high
memory pressure.
This design dramatically reduces the amount of kernel overhead and the
number of induced cold misses caused by needless thrashing of the page
cache.
The resulting hybrid architecture is called adaptive S-COMA
(AS-COMA).
AS-COMA exploits the best of S-COMA and CC-NUMA, performing like an S-COMA
machine at low memory pressure and like a CC-NUMA machine at high memory
pressure.
AS-COMA outperforms CC-NUMA under almost all conditions, and outperforms
other hybrid architectures by up to 17% at low memory pressure and up to
90% at high memory pressure.
(April 1997)
Abstract
This paper describes experience in parallelizing an execution-driven
architectural simulation system used in the development and evaluation
of the Avalanche distributed architecture. It reports on a specific
application of conservative distributed simulation on a shared memory
platform. Various communication-intensive synchronization algorithms
are described and evaluated. Performance results on a bus-based
shared memory platform are reported, and extension and scalability of
the implementation to larger distributed shared memory configurations
are discussed. Also addressed are specific characteristics of
architectural simulations that contribute to decisions relating to the
conservatism of the approach and to the achievable performance.
(December 1996)
Abstract
The key to crafting an effective scalable parallel computing system lies in
minimizing the delays imposed by the system. Of particular importance are
communications delays, since parallel algorithms must communicate
frequently. The communication delay is a system-imposed latency. The
existence of relatively inexpensive high performance workstations and
emerging high performance interconnect options provide compelling economic
motivation to investigate NOW/COW (network/cluster of workstation)
architectures. However, these commercial components have been designed for
generality. Cluster nodes are connected by longer physical wire paths than
found in special-purpose supercomputer systems. Both effects tend to
impose intractable latencies on communication. Even larger system-imposed
delays result from the overhead of sending and receiving messages. This
overhead can come in several forms, including CPU occupancy by protocol and
device code as well as interference with CPU access to various levels of
the memory hierarchy. Access contention becomes even more onerous when the
nodes in the system are themselves symmetric multiprocessors. Additional
delays are incurred if the communication mechanism requires processes to
run concurrently in order to communicate with acceptable efficiency. This
paper presents the approach taken by the Utah Avalanche project which spans
user level code, operating system support, and network interface hardware.
The result minimizes the constraining effects of latency, overhead, and
loosely coupled scheduling that are common characteristics in NOW-based
architectures.
Support for Distributed Shared Memory in Avalanche
(ASPLOS'96 Workshop -- October 1996)
(Ghostview incompatible, yet printable postscript slides available
here.)
Abstract
This talk concentrated on the unique design features present in
Avalanche for supporting scalable shared memory, including support for
both the Simple COMA and CC-NUMA architectures, support for multiple
coherency protocols, and some low-level design optimizations. The talk
also discusses a number of challenges that we faced in using
off-the-shelf hardware over which we had little to no control. Detailed
implementation issues and example operations are included to illustrate
the DSM architecture, and preliminary simulation-derived results are
presented.
(June 1996)
Abstract
Version 2.0 of the Myrinet simulation package was designed to allow a
high degree of configurability of the modeled network. Version 1.0
modeled only simple square mesh topologies with 4-port switches, and
users could specify only a limited number of switch parameters. As
Myricom released larger and faster versions of their Myrinet switches,
the V1.0 simulation model became obsolete. Users of the V2.0 package
can specify arbitrary network topologies composed of Myrinet switches
with different number of ports. For example, 4-port and 32-port
switches can be used in a single system. Because the V2.0 model
supports arbitrary topologies, simple X-then-Y source routing is no
longer sufficient to model the required routing. Thus, users of the
V2.0 package must specify the routing table themselves. In addition,
to track improvements to the circuit technologies used in the Myrinet
switches, the clock rate, latency and bandwidth have been
parameterized. Users can change the parameters in order to meet their
simulation needs.
Myrinet Simulation Package Source Code (235 KB)
(March 1996)
Abstract
This document describes Paint, an instruction set simulator based on
Mint. Paint interprets the PA-RISC instruction set, and
has been extended to support the Avalanche Scalable Computing Project.
These extensions include a new process model that allows multiple
programs to be run on each processor and the ability to model both
kernel and user code on each processor. In addition, a new address
space model more accurately detects when a program is accessing an
illegal virtual address, allows a program's virtual address space to
grow dynamically, and does lazy allocation of physical pages as
programs need them.
Note that this document is intended to be an addendum to the original Mint
technical report, which the reader should consult for an overview of the
Mint simulation environment and terminology. This preliminary version of
the document is intended for people who plan to use Paint as a simulation
engine, not modify it.
Click here
to request the latest version of the PA-Int software package (~8 MB).
(March 1996)
Abstract
In this paper, we present a new partial order reduction algorithm that
can help reduce both space and time requirements of automatic
verifiers. The partial order reduction algorithms described in
[God95, Hol94] (both incorporated in SPIN [Hol91]) were observed to
yield very little savings in many practical cases due to the proviso
in them. Our algorithm, called the two-phase algorithm, is
different from these algorithms in that it avoids the proviso, and
follows a new execution strategy consisting of alternating phases of
depth-first-search and partial order reduction. The two-phase
algorithm is shown to preserve safety properties. It has been
implemented in a new verifier which, like SPIN, takes Promela as input.
Comparisons between these algorithms and the two-phase algorithm are
provided for a significant number of examples, including directory
based protocols of a new multiprocessor where significant performance
gains are obtained.
(March 1996)
Abstract
Minimizing communication latency in message passing multiprocessing systems
is critical. An emerging problem in these systems is the latency
contribution costs caused by the need to percolate the message through the
memory hierarchy (at both sending and receiving nodes) and the additional
cost of managing consistency within the hierarchy. This paper, considers
three important aspects of these costs: cache coherence, message copying,
and cache miss rates. The paper then shows via a simulation study how a
design called the Widget can be used with existing commercial workstation
technology to significantly reduce these costs to support efficient message
passing in the Avalanche multiprocessing system.
(January 1996)
Abstract
The use of workstations on a local area network to form scalable
multicomputers has become quite common. A serious performance
bottleneck in such ``carpet clusters'' is the communication protocol
that is used to send data between nodes. We report on the design and
implementation of a class of communication protocols, known as
sender-based, in which the sender specifies the locations at which
messages are placed in the receiver's address space. The protocols are
shown to deliver near-link latency and near-link bandwidth using
Medusa FDDI controllers, within the BSD 4.3 and HP-UX 9.01 operating
systems. The protocols are also shown to be flexible and powerful
enough to support common distributed programming models, including but
not limited to RPC, while maintaining expected standards of system and
application security and integrity.
(November 1995)
Abstract
Efficient synchronization is an essential component of parallel
computing. The designers of traditional multiprocessors have included
hardware support only for simple operations such as compare-and-swap
and load-linked/store-conditional, while high level synchronization
primitives such as locks, barriers, and condition variables have been
implemented in software. With the advent of directory-based
distributed shared memory (DSM) multiprocessors with significant
flexibility in their cache controllers, it is worthwhile considering
whether this flexibility should be used to support higher level
synchronization primitives in hardware. In particular, as part of
maintaining data consistency, these architectures maintain lists of
processors with a copy of a given cache line, which is most of the
hardware needed to implement distributed locks.
We studied two software and four hardware implementations of locks and
found that hardware implementation can reduce lock acquire and release
times by 25-94% compared to well tuned software locks. In terms of
macrobenchmark performance, hardware locks reduce application running
times by up to 75% on a synthetic benchmark with heavy lock contention
and by 3%-6% on a suite of SPLASH-2 benchmarks. In addition, emerging
cache coherence protocols promise to increase the time spent
synchronizing relative to the time spent accessing shared data, and
our study shows that hardware locks can reduce SPLASH-2 execution
times by up to 10-13% if the time spent accessing shared data is
small.
Although the overall performance impact of hardware lock mechanisms
varies tremendously depending on the application, the added hardware
complexity on a flexible architecture like FLASH or Avalanche is
negligible, and thus hardware support for high level synchronization
operations should be provided.
(August 1995)
Abstract
This document describes the interface and functional specification of
a Protocol Processing Engine (PPE) for workstation clusters. The PPE
is intended to provide the support necessary to implement low latency
protocols requiring only low resource (cpu and bus bandwidth) consumption.
(February 1995)
Abstract
We investigate techniques for reducing the memory requirements of a
model checking tool employing explicit enumeration. Two techniques
are studied in depth: (1) exploiting symmetries in the model, and
(2) exploiting sequential regions in the model. The first technique
resulted in a significant reduction in memory requirements at the
expense of an increase in run time. It is capable of finding progress
violations at much lower stack depths. In addition, it is more
general than two previously published methods to exploit symmetries,
namely scalar sets and network invariants. The second technique comes
with no time overheads and can effect significant memory usage
reductions directly related to the amount of sequentiality in the
model. Both techniques have been implemented as part of the SPIN verifier.
(January 1995)
Abstract
This note describes the Direct Deposit Protocol (DDP), a simple
protocol for multicomputing on a carpet cluster. This protocol is an
example of a user-level protocol to be layered on top of the
low-level, sender-based protocols for the Protocol Processing Engine.
The protocol will be described in terms of its system call interface
and an operational decription.
(January 1995)
Abstract
We present design details and some initial performance results of
a novel scalable shared memory multiprocessor architecture.
This architecture features the
automatic data migration and replication capabilities of cache-only
memory architecture (COMA) machines, without the accompanying hardware
complexity.
A software layer manages
cache space allocation at a page-granularity - similarly to
distributed virtual shared memory (DVSM) systems, leaving
simpler hardware to maintain shared memory
coherence at a cache line granularity.
By reducing the hardware complexity, the machine cost and development time
are reduced.
We call the resulting hybrid hardware and software multiprocessor
architecture Simple COMA.
Preliminary results indicate that the
performance of Simple COMA is comparable to that of more complex
contemporary all-hardware designs.
(March 1994)
Abstract
Formal verification of hardware and software systems has long been
recognized as an essential step in the development process of a system.
It is of importance especially in concurrent systems that are more
difficult to debug than sequential systems. Tools that are powerful
enough to verify real-life systems have become available recently.Model
checking tools have become quite popular because of their ability to
carry out proofs with minimal human intervention. In this paper we
report our experience with SMV, a symbolic model verifier on practical
problems of significant sizes. We present verification of a software
system, a distributed shared memory protocol, and a hardware system,
the crossbar arbiter. We discuss modeling of these systems in SMV and
their verification using temporal logic CTL queries. We also describe
the problems encountered in tackling these examples and suggest possible
solutions.
Abstract
For a parallel architecture to scale effectively, communication
latency between processors must be avoided.
We have found that the source of a large number of avoidable cache
misses is the use of hardwired write-invalidate coherency protocols,
which often exhibit high cache miss rates due to excessive
invalidations and subsequent reloading of shared data.
In the Avalanche project at the University of Utah, we are building a
64-node multiprocessor designed to reduce the end-to-end communication
latency of both shared memory and message passing programs.
As part of our design efforts, we are evaluating the potential
performance benefits and implementation complexity of providing
hardware support for multiple coherency protocols.
Using a detailed architecture simulation of Avalanche, we have found
that support for multiple consistency protocols can reduce the time
parallel applications spend stalled on memory operations by up to 66%
and overall execution time by up to 31%.
Most of this reduction in memory stall time is due to a novel
release-consistent multiple-writer write-update protocol implemented
using a write state buffer.
Abstract
This note describes a prototype DSM protocol and directory cache controller
design for the architectural simulation of a carpet cluster of nodes
with modified PA7100 CPUs and two levels of cache.
The design is functionally decomposed into three separate blocks:
the Context Sensitive Cache Controller Unit, the Directory Controller, and
the Network Controller.
Four access protocols are supported by the design: local, migratory, write
invalidate, and write shared.
This no longer represents the current architecture, but is an
early overview of the project goals.
Abstract
As the gap between processor and memory speeds widens, system designers
will inevitably incorporate increasingly deep memory hierarchies to
maintain the balance between processor and memory system performance.
At the same time, most communication subsystems are permitted
access only to main memory and not a processor's top level cache.
As memory latencies increase, this lack of integration between the
memory and communication systems will seriously impede interprocessor
communication performance and limit effective scalability.
In the Avalanche project we are redesigning the memory architecture of a
commercial RISC multiprocessor, the HP PA-RISC 7100, to include a new
multi-level context sensitive cache that is tightly coupled to the
communication fabric.
The primary goal of Avalanche's integrated cache and communication
controller is attacking end to end communication latency in all of its
forms.
This includes cache misses induced by excessive invalidations and reloading
of shared data by write-invalidate coherence protocols and cache misses
induced by depositing incoming message data in main memory and faulting it
into the cache.
An execution-driven simulation study of Avalanche's architecture indicates
that it can reduce cache stalls by 5-60% and overall execution times
by 10-28%.
Abstract
The next generation of scalable parallel systems (e.g., machines by KSR,
Convex, and others) will have shared memory supported in hardware, unlike
most current generation machines (e.g., offerings by Intel, nCube, and
Thinking Machines). However, current shared memory architectures are
constrained by the fact that their cache controllers are hardwired and
inflexible, which limits the range of programs that can achieve scalable
performance. This observation has led a number of researchers to propose
building programmable multiprocessor cache controllers that can
implement a variety of caching protocols, support multiple communication
paradigms, or accept guidance from software. To evaluate the potential
performance benefits of these designs, we have simulated five SPLASH
benchmark programs on a virtual multiprocessor that supports five
directory-based caching protocols. When we compared the off-line optimal
performance of this design, wherein each cache line was maintained using
the protocol that required the least communication, with the performance
achieved when using a single protocol for all lines, we found that use of
the ``optimal'' protocol reduced consistency traffic by 10-80%, with a
mean improvement of 25-35%. Cache miss rates also dropped by up to 25%.
Thus, the combination of programmable (or tunable) hardware and software
able to exploit this added flexibility, e.g., via user pragmas or compiler
analysis, could dramatically improve the performance of future shared
memory multiprocessors.
see also
UUCS Technical Reports
This work was sponsored by
the Space and Naval Warfare Systems Command (SPAWAR) and
Advanced Research Projects Agency (ARPA),
Communication and Memory Architectures for
Scalable Parallel Computing,
ARPA order #B990 under SPAWAR contract #N00039-95-C-0018.
Feedback to <avalanche@jensen.cs.utah.edu>.
Last modified around December 18, 1996.