Communications and Memory Architectures for Scalable Parallel Computing

Avalanche Project Summary -- Summer 1996


Objective

The objective is the design and implementation of a communications interface for commodity workstations that intelligently integrates message passing into the memory-hierarchy and supports distributed shared memory in a manner that is fast, flexible, and adaptive. The goal is to provide these capabilities to production workstations without requiring replacement of any existing subsystems and without compromising workstation performance in any way. The result will be construction of a 64 node multicomputer prototype using HP Runway based workstations, the Myrinet communications fabric, and our custom interface board to demonstrate extremely low latency user-to-user message passing and highly effective distributed shared memory.

Approach

Avalanche is a communications architecture that recognizes the importance of memory hierarchies in the real end-to-end cost of interprocessor communications. It provides both message passing (MP) and distributed shared memory (DSM) because each approach has certain advantages over the other and both together promise to address a wide range of applications better than either in isolation.

Integrating MP and DSM in the same interface exploits resources required by the common needs to interact closely with the memory hierarchy and coherence bus of the host node. DSM relies on this closeness for maintaining consistency of shared data, while MP needs to address the growing contribution of memory hierarchy costs to end-to-end communications costs. The Avalanche MP approach utilizes a combination of simple, but powerful, message delivery capabilities and a set of safe lightweight protocols to deliver data directly into a receiver's virtual address space at the proper level in the physical memory hierarchy. The Avalanche DSM approach combines a number of consistency protocols with two adaptively selected shared memory models (Simple-COMA and CC-NUMA) to enable the system to match DSM support to application requirements. It is pipelined, highly parallel hardware DSM support will provide efficient shared memory operations while minimizing the occupancy-based latency that processor-based approaches are susceptible to.

While close integration to the memory hierarchy is important to success for Avalanche, cost and real performance are equally important. Approaches that significantly diminish the normal performance of the constituent workstations are unacceptable. A decrease in single system performance increases the level of multicomputing efficiency that the approach must deliver. Moreover, it raises doubts about the applicability of the approach to future systems where basic workstation performance will undoubtedly be higher. The Avalanche approach is to design an interface that can be plugged into the system bus. It does not rely on replacing (and reinventing) either the cache that is close to the processor or the memory controller. Both subsystems are too tightly tuned to a particular architecture and are therefore too costly to replace without a significant effect on system balance.


Recent Accomplishments

Finalized behavioral design of the message passing, system bus interface, and network interface subcomponents of the Avalanche Widget. Initiated VHDL implementation of the system bus interface as a test chip.

Extended the PAINT simulator to support those operating system features to enable simulation of both OS-level code and client applications.

Developed a low-level protocol and network interface to the Myrinet that adjusts packet sizes dynamically, in real time. This minimizes network the fall-through delays of multiple packets while avoiding holding wormhole routes with idle cycles when outgoing message data is delayed in the host memory hierarchy.

Developed a behavioral design of the DSM subcomponents of the Widget and initated implementation within PAINT.

Extended the Myrinet simulation software (part of PAINT) to support multiple switch sizes and completely user-specified topologies. This allows detailed simulations of the kinds of Myrinets we expect to see in the real world, comprised of switches of varying sizes with irregular connection topologies.


1997 Plans

Complete simulation of DSM subcomponents of the Widget, followed by finalization of their behavioral design.

Hardware design of all Widget subcomponents; integration of subcomponents into a complete interface.

Construct 64 node prototype using HP Runway based workstations (one to three PA-8000 or PA-7200 processors per node) and the Myrinet interconnect fabric.


Technology Transition Plans

One way to view the results of this effort is that it results in a significantly improved network interface card for the Myrinet. Numerous DOD efforts are using the Myrinet for prototype efforts and they can all benefit from this effort. Our prototype efforts directly target Runway based HP workstations. This fits nicely with the some of the Aegis platform efforts at the Naval Surface Warfare Center that are based on a military version of this technology. It is too early to solidify these partnerships since we do not yet have the critical ASIC working. However in a year, this situation will change and hence paving the way for this technology to be used is an important agenda item for the coming year.

On a more public front, the PAINT multicomputer simulation environment has been made available for other researchers via the Avalanche web page. PAINT is a PA-RISC binary direct execution simulator which includes the ability to perform multiprogramming on each simulated node and the ability to run a small OS kernel. It has been tested on HP PA-RISC workstations running either HP-UX (version 9.05) or 4.3BSD operating systems. Contact Mark Swanson (swanson@cs.utah.edu) or Leigh Stoller (stoller@cs.utah.edu). Hewlett Packard Laboratories is also making efforts to move this technology in house.

The Myrinet simulator is available for researchers wishing to model that fabric. It features multiple switch sizes and completely user-specifiable connection topologies. It features several levels of simulation accuracy (from a simple, output-contention only model to a detailed internal fabric contention model). It runs within either MINT or PAINT and has the sames system requirements as those simulators. Contact Chen-Chi Kuo (chenchi@cs.utah.edu).




This work was sponsored by the Space and Naval Warfare Systems Command (SPAWAR) and Advanced Research Projects Agency (ARPA), Communication and Memory Architectures for Scalable Parallel Computing, ARPA order #B990 under SPAWAR contract #N00039-95-C-0134.
Feedback to <avalanche@jensen.cs.utah.edu>.
Last modified around July 16, 1996.