Go backward to Multiprocessor Simulation Engine Design
Go up to Top
Go forward to Support for Efficient Message Passing
Cache and Communication Control Unit Design
Each Avalanche node will contain a PA-RISC 8000 processor, a Context
Sensitive Cache and Communication Unit (CSCCU),
and local memory, as illustrated.
In
our current design,
the CSCCU is decomposed into three functional units:
the cache controller (CC), the directory controller (DC),
and the network controller (NWC).
The CC manages the memory hierarchy and performs the protocol actions
needed to maintain consistency between the various levels of the hierarchy
and between separate nodes in the case of shared memory operation.
The DC maintains the state of the distributed shared memory -- each block
of global physical memory has a "home" node and the DC on this home node
keeps track of state information such as the protocol being used to manage
that block of data and the nodes that have a copy of the data.
The NWC transmits and receives messages from the
Myrinet interconnect,
queuing outgoing messages as necessary and routing incoming messages to
the appropriate functional unit.
The CC, DC, and NWC are connected using separate FIFO queues.
Messages are used to communicate between independent control units, both
within a single node and across nodes.
Messages contain commands (e.g., as part of handling a cache miss, the CC
may request that the DC managing the state of the required data modify its
state) and to transmit data (e.g., as part of handling an incoming cache
fill message, the NWC sends a message to the CC requesting that the data be
placed in the appropriate location in the memory hierarchy).
In addition, all three functional units are connected to a transition
buffer (TB), a fast multi-ported SRAM memory array divided into cache
line sized (128 bytes) entries.
The TB is normally used to store the data associated with a message, but we
are considering ways to overload its functionality to support victim
caching and prefetching.
Figure 1 : Avalanche Node Organization
The
results that we have previously reported
were based on the simulation
of a simpler CSCCU design that did not accurately model internal interlocks
and pipelining within the CSCCU.
We are in the process of extending our simulation model to include this
decomposed structure of the CSCCU, accurately model the internal
interlocks, and investigate a wide variety of design possibilities.
Among the design space options that we are investigating are:
- the number of levels of cache and its organization at each level
(e.g., direct mapped vs set associative, write around
vs write through, write allocate vs no write allocate, etc.),
- the value and complexity of supporting multiple DSM consistency
protocols in hardware (e.g., write-invalidate, multi-writer
write-update, and migratory),
- the value of allowing outgoing (incoming) message data to be read
(written) from (to) any level of the memory hierarchy,
- the effectiveness of using a release state buffer and
per-word dirty/valid bits to support a delayed write-update protocol
and efficient handling of small (under 128-byte) messages,
- the effectiveness of using part of the Transition Buffer to support
a software-controllable prefetch unit,
- the effectiveness of using part of the Transition Buffer to support
victim caching,
- the appropriate degree of integration of DSM and message passing,
- the appropriate granularity of pipelining within the CSCCU,
- what form of built in synchronization support, if any, should
be supported in hardware,
- the effectiveness and complexity of providing low level
broadcast/multicast support for updates and barrier synchronization,
and
- the potential benefits of and hardware requirements for supporting
compiler-driven prefetching, protocol selection, and synchronization
optimizations.
As
reported earlier,
we have found that support for multiple consistency
protocols, and in particular support for a novel multi-writer write-update
protocol, reduced the cache stall time of a suite of shared memory parallel
programs by 5% to 60% and reduced their running time by 10% to 28%
compared to conventional designs.
This significant reduction in memory overhead was largely the result of
matching the consistency protocol used to manage the data with the way it
is used; this reduced the amount of communication required to maintain
consistency and also reduced the number of unnecessary and very expensive
cache misses.
This makes it clear that support for multiple consistency protocols is
a clear win -- we are in the process of simulating a wide variety of
other options to determine which provide a worthwhile performance versus
complexity tradeoff.
Our current design target for the entire CSCCU chip is one million
transistors, which will severely limit the amount of on chip buffering
available and the size of the FIFOs.
To increase the amount of buffering available and the size of the cache(s),
we are considering using MCM technology to add additional on-chip SRAM by
incorporating unmounted off the shelf SRAM.
This work was sponsored by the
Space and Naval Warfare Systems Command (SPAWAR) and
Advanced Research Projects Agency (ARPA),
Communication and Memory Architectures for
Scalable Parallel Computing,
ARPA order #B990 under SPAWAR contract #N00039-95-C-0018
Back to the
Avalanche Project Home Page,
or Computer Science Department Home Page.
Feedback to <avalanche@jensen.cs.utah.edu>.