[Previous] [Up] [Next]
Go backward to Approach
Go up to Top
Go forward to Technology Transition Plans

Recent Accomplishments

The Avalanche Architectural Simulation Environment

To support the evaluation of our designs in both the distributed shared memory and message passing arenas, a multiprocessor architectural simulation environment was required. We began with Mint, an existing system. To increase its accuracy, applicability, and flexibility, we enhanced Mint as follows:

Together, all of these changes enable us now to simulate multiprogrammed multiprocessing with operating system support including dynamic memory allocation for the simulated programs; appropriate overhead costs are accounted for, avoiding the inaccuracies inherent in simulations that ignore OS involvement. These changes were accomplished with virtually no degradation in performance of the simulator; in fact, we have significantly increasing the performance of the simulator over this period.

The simulation environment has also been augmented with a cache and memory hierarchy simulator, which includes facilities for both message passing models: local memory only with various strategies for placing received message data into the memory hierarchy; and for distributed shared memory: local and global memory with cache coherence of global memory maintained with a selection of protocols. As in all simulation efforts, tradeoffs exist between accuracy and speed of simulation, so we have implemented three levels of network simulation, ranging from a contention free to flit-by-flit. All three models implement a Myrinet-like fabric with increasing degrees of authenticity.

The Direct Deposit Message Passing Protocols

To exploit the message passing latency enhancing effects that the Avalanche project is projecting, two things are required of the message passing protocols used:

We have developed the Direct Deposit protocols and associated network interface to accomplish these effects. Direct Deposit is a realization of Sender-Based Protocols. It accomplishes the first goal by taking advantage of fabric characteristics to provide some of the reliability and flow control features normally found in transport protocols and by relying on the user-level code to provide those other transport features that the application requires. Simulation results indicate that a one way transmission of a 4 byte message on a 100 MHz system requires only 9 microseconds user-to-user; this includes two unavoidable cache misses.

Direct Deposit achieves the second goal by supporting movement of message data directly from user memory to the fabric and from the fabric directly to the receiving application's memory. The data copies thus avoided amplify the importance of the proper placement of incoming message data into user memory, as there are no system-imposed cache misses due to message copying.

Context-Sensitive Cache and Communication Controller Design

Over the past year we have made significant changes to the design of the Avalanche machine. In our current design, each Avalanche node is built around a Hewlett-Packard PA-RISC 8000 processor, which is expected to ship in mid-1996. The processor is directly connected to two banks of cache SRAMs and HP's high-speed RUNWAY bus. The RUNWAY bus defines a limited cache coherence protocol, supported by the PA-RISC 8000, that we will build upon in our design of Avalanche. Also connected to the RUNWAY bus is the local memory (DRAM) controller, an I/O adaptor, and the Avalanche Cache and Communication Controller Unit (CCCU).

Figure 1 : Avalanche Node Organization
  This design should allow us to use commercially available HP products as the basis for Avalanche nodes with little modification beyond the addition of a board for the Avalanche CCCU.

Our design and simulation efforts have centered on the CCCU, which is responsible for managing the local memory hierarchy to support both efficient message passing (MP) and efficient distributed shared memory (DSM). In the current design, the CCCU is decomposed into three major functional units: the distributed shared memory (DSM) unit, the message passing (MP) unit, and the Myrinet network controller (NWC).

Figure 2 : Cache and Communication Controller Design
  The DSM unit is itself divided into two components: a cache controller (CC) and a directory controller (DC). The CC manages the local memory hierarchy and performs the protocol actions needed to maintain consistency between nodes sharing memory. The DC maintains the state of the distributed shared memory -- each block of global physical memory has a "home" node and the DC on this home node keeps track of state information such as the protocol being used to manage that block of data and the nodes that have a copy of the data. The MP unit also is divided into two components: a protocol processor (PPE) and intelligent injection unit (IIU). The PPE is responsible for performing the necessary protocol operations (e.g., fragmentation and reassembly) and managing connections. The IIU will coordinate with the cache controller to place incoming message data into the appropriate level of the memory hierarchy. The NWC transmits and receives messages from the Myrinet interconnect, queuing outgoing messages as necessary and routing incoming messages to the appropriate functional unit. In addition, all three functional units are connected to a transition buffer (TB), an SRAM memory array divided into cache-line sized entries. The primary use of the TB is to store state associated with an ongoing memory transaction, initiated by either the DSM or MP unit. However, we are considering ways to overload its functionality to support victim caching and prefetching.

We are extending our existing simulation environment to model this revised Avalanche design so that we can investigate a number of design space options, including:

As reported earlier, we have found that support for multiple consistency protocols, and in particular support for a novel multi-writer write-update protocol, reduced the cache stall time of a suite of shared memory parallel programs by 5% to 60% and reduced their running time by 10% to 28% compared to conventional designs. This significant reduction in memory overhead was largely the result of matching the consistency protocol used to manage the data with the way it is used; this reduced the amount of communication required to maintain consistency and also reduced the number of unnecessary and very expensive cache misses. This indicates that support for multiple consistency protocols is a win, and we are in the process of simulating a variety of other options to determine which provide a worthwhile performance versus complexity tradeoff.

Verification

The Avalanche design involves many concurrent protocols, including high-level protocols used to support distributed shared memory (DSM) and message passing, and low-level protocols used to manage bus transactions, to manage buffers within our context-sensitive ("intelligent") cache controller unit, etc. The correctness of these protocols is going to be crucial to the success of our experiments. It has been documented in literature that conventional simulation methods based on manually selected vectors can often miss "show-stopper" bugs in debugging complex protocols such as ours. Also, it is now widely recognized that formal verification methods can complement simulation methods quite effectively. One faculty associate (Ganesh Gopalakrishnan) and his PhD student (Ratan Nalumasu) are currently involved in applying formal methods to validate the concurrent protocols used in the Avalanche project.

We are developing verification methods that can

We have used the SPIN verifier developed at AT&T to model numerous cache management protocols used in the DSM component of Avalanche. Specific protocols modeled to date include the migratory protocol, the invalidation protocol, and the update protocol. Some aspects of these protocols were validated successfully. A wrong assumption about message receptivity (which can lead to a deadlock) was corrected as a result of our work so far. Also, states in which some of these protocols rely on in-order message delivery are being discovered in the course of our work. Based on these discoveries, we have reconfirmed our belief that formal methods must be used during the early phases of complex projects, and when used so, they facilitate concurrent engineering by allowing the testing phase (one of the most time consuming phases of a project) to begin early.

One of the major challenges faced by model-checking based formal verification tools (the class SPIN belongs to) is that of state explosion. We have identified concrete reasons for state explosion in the class of models we are dealing with, and are working on new methods to attack these problems. Specific details now follow.

We have discovered a simple method for combating state explosion in the context of SPIN, called state normalization. In order to use state normalization, the designer must first identify the symmetries in the system (currently done manually). He/she then writes a normalizer function that maps each system state $s$ to a corresponding normalized state $s^'$. Our results (on an extended version of the SPIN verifier) show that the method introduces a low overall time overhead and effects a dramatic reduction in the number of states generated. This work will be presented in the Computer Hardware Description Languages conference, Chiba, Japan, August, 1995.

Another reason for SPIN's state-explosion is due to its overly aggressive state-caching. A natural solution to this problem is not to cache states that have an in-degree of one; but (prior to our work) there was no obvious method to tell which states are likely to have in-degrees close to (ideally equal to) one. We have discovered through our experiments that the states generated by executing local transitions (transitions that commute with all other transitions) usually have an in-degree of one (or very close to one) and are therefore good candidates not to cache. A modified version of the SPIN verifier that implements our new state caching heuristic confirms that this is, indeed a good heuristic. This work is in preparation as a technical report (soon to be available through our Web pages).

We are currently working on additional methods to prevent state explosion, and thereby reduce the tedium of verifying Avalanche models. Two more reasons for state explosion are now identified. A sketch of our own verifier is then provided.

Another reason for SPIN's state explosion stems from the fact that the interleaving model it uses, by nature, tends not to "reset" some of the "final states" generated (final states are those from which there is only a silent transition back to the "top-level" state of the machine). As a result, these final-states generate many more configurations than necessary. A mode of execution where these final states are eagerly reset would greatly benefit us.

The network invariant method is a family of methods concerned with proving properties about arbitrarily sized networks. In one approach of this type, a network P1 || P2 || ... of processes is represented by a more general description of the form P || Q where Q represents a network of an arbitrary number of P's. If a process such as Q (called the "network invariant") can be found, the task of verification is greatly simplified. The discovery of network invariants is, in general, a manual process. However, it appears that at least for a simple class of protocols, network invariants can be automatically computed. Our "method" has, so far, worked on some non-trivial examples. We are in the process of rigorously defining the class of processes to which our method seems to apply, and following that we will articulate a general algorithm covering the class.

The insights gained to date make us believe that writing our own verifier would be the most productive way to proceed. We are in the process of developing our own verifier, banking on Nalumasu's experience over two summer internships at AT&T Bell Labs and Gopalakrishnan's upcoming sabbatical which will be spent at the verification group at Stanford to get a working prototype up in a month or two. We believe that directly writing verification models in C (with embedded "verification calls") would be a resource-efficient way to proceed in the scope of the Avalanche project. An interesting parallel exists between what we propose to do and what the simulation community has already done: large microprocessors are often described using dedicated C programs which serve as the simulation model for these processors.


This work was sponsored by the Space and Naval Warfare Systems Command (SPAWAR) and Advanced Research Projects Agency (ARPA), Communication and Memory Architectures for Scalable Parallel Computing, ARPA order #B990 under SPAWAR contract #N00039-95-C-0018
Back to the Avalanche Project Home Page, or Computer Science Department Home Page.
Feedback to <avalanche@jensen.cs.utah.edu>.

[Previous] [Up] [Next]