Everything.

Overview of ARCHTEST

Presented at the
Tutorial and Workshop
on Formal Specification
and Verification Methods
for Shared Memory Systems

31 October 2000
Austin, Texas, USA

by
William W. Collier
Multiprocessor Diagnostics
collier@acm.org
www.mpdiag.com

The most recent version of these foilscan be found starting at www.mpdiag.com/overview.htm.
Last updated October 27, 2000.


Contents

Next

Copyright (C) 2000 Multiprocessor Diagnostics. All rights reserved.

Table of Contents


The Rules of Computation.


The Order Rules.


The Rules of Cache Coherence.


For further information on material in this talk see:

Info on ARCHTEST: http://www.mpdiag.com/archtest.html. Info on Reasoning about Parallel Archtectures: http://www.mpdiag.com/rapa.html.

---------------------------------------------

ARCHTEST.

ARCHTEST is a program which runs on a shared memory multiprocessor and seeks to determine what rules a machine fails to obey.

ARCHTEST has been used by many companies in the development and testing of new shared memory systems.

In addition, considerable research has been based on ARCHTEST by Prof. G. C. Gopalakrishnan and his students at the University of Utah.

---------------------------------------------

The Limits of Testing with ARCHTEST (I).

ARCHTEST cannot prove that a machine obeys a given architecture.

(You can fish in a lake all day and not catch any fish, but that does not prove that there are no fish in the lake).

ARCHTEST can show only that a machine violates an architecture.

(However, the output from ARCHTEST can also suggest the frequency and severity of violations of particular architectures.)

---------------------------------------------

The Limits of Testing with ARCHTEST (II).

ARCHTEST cannot prove that a machine obeys an arbitrary architecture.

There are only nine unique tests known.

Therefore, we can test only for the architectures associated with those nine tests.

---------------------------------------------

The Limits of Testing with ARCHTEST (III).

There are some rules which every machine is expected to obey all the time, namely CMP and UPO.

There are some tests which involve only, say, rule R and CMP and UPO. If such a test fails, then it is clear the machine violates rule R.

Such a test is called a pure test for R.

For most rules there exists no pure test.

---------------------------------------------

Example: Test T7 from ARCHTEST.

Test T700. Seek a relaxation of A(CMP,UPO,RR,CC1).

     Initially, A = B = U[i] = V[i] = 0, i = 0 to K.

           P1                P2
     L00: A    = 0;    L00: B    = 0;
     L01: -    = A;    L01: -    = B;
     L02: U[0] = B;    L02: V[0] = A;
     L10: A    = 1;    L10: B    = 1;
     L11: -    = A;    L11: -    = B;
     L12: U[1] = B;    L12: V[1] = A;
     L20: A    = 2;    L20: B    = 2;
     L21: -    = A;    L21: -    = B;
     L22: U[2] = B;    L22: V[2] = A;
     L30: A    = 3;    L30: B    = 3;
     L31: -    = A;    L31: -    = B;
     L32: U[3] = B;    L32: V[3] = A; etc.

     Seek 7.1. U[i] < j and V[j] < i.  d1 = V[ U[i]+1 ] - i < 0.
     Seek 7.2. V[i] < j and U[j] < i.  d2 = U[ V[i]+1 ] - i < 0.

       To show:  Not A(CMP,UPO,RR,CC1).
To understand the ideas behind this test, we need to understand executions, event, rules, and architectures. ---------------------------------------------

Executions and Events.

An execution is the act of executing a program P, which operates on some input I to produce a result R.

An execution is composed of one or more processes, each of which is composed of one or more statements.

A statement consists of one or more read events along with one write event for each process in the execution.

---------------------------------------------

One write event per process.

Test T600.

Initially, A = B = U = V = X = Y = 0.
 
  P1      P2      P3      P4
A = 1;  U = A;  X = B;  B = 1;
        V = B;  Y = A;  
        
Initially, A = B = U = X = 1, V = Y= 0.
The events for the first statement in P2 are:

(P2,L1,R,1,A,S2)
(P2,L1,W,1,U,S1)
(P2,L1,W,1,U,S2)
(P2,L1,W,1,U,S3)
(P2,L1,W,1,U,S4)
The write events represent the times at which the new value of A becomes visible to the stores, S1, S2, S3, S4 for processes P1, P2, P3, P4, respectively.

---------------------------------------------

Rules.

Programmers expect that machines will exhibit certain behaviors.

These behaviors called rules.

The computation rules are so obvious that they are usually taken for granted, but it is not possible to reason about the behavior of a machine without defining these rules explicitly.

The uniprocessor order rules describe the ordering rules that both uniprocessors and multiprocessors need to obey.

The order rules are simple. Program order (PO) requires that all events in a process occur in the order defined by the underlying program.

The cache coherence rules describe the patterns in time at which the output from a statement becomes visible to all processes in an execution.

Each rule induces a partial ordering on the events in an execution. (Actually it is more complicated than that.)

---------------------------------------------

The Computation Rules.

The computation rules are so intuitively obvious that they are almost never stated explicitly. However, they must be defined formally and precisely in order for them to be used in reasoning about interactions between processes on shared data.

A failure to obey one of the rules of computation might look like this:

     Initially, A = 0.

         P1
     L0: A = 1;

     Terminally, A = 13.
Operand A is initially zero. Then it is set to one. But at the end of the execution it has the value 13. Clearly, the machine on which the execution occurred did not compute in the way one normally expects a machine to behave.

---------------------------------------------

The Statement-Read-Write Rule (SRW).

Write
     (P1,L0,R,0,B,S1) <srw (P1,L0,W,0,A,S1)
to show that in
         P1
     L0: A = B;
the read event(s) in a statement occur before the write events.

---------------------------------------------

The Compute-Read-Write Rule (CWR).

     Initially, A = X = 0.

         P1             P2
     L0: A = 1;     L0: X = A;
If X = 0 at the end of the execution, then it must be that the read event in process P2 occurred before the write event in process P1. Represent this by
     (P2,L0,R,0,A,S2) <crw (P1,L0,W,1,A,S2)
to show that statement L0 in process P2 read a 0 from operand A in S2 before statement L0 in process P1 wrote a 1 value into operand A in S2.

---------------------------------------------

The Compute-Write-Read Rule (CWR).

     Initially, A = X = 0.

         P1             P2
     L0: A = 1;     L0: X = A;

     Terminally, A = X = 1.
If X = 1 at the end of the execution, then it must be that the write event in process P1 occurred before the read event in process P2. Represent this by
     (P1,L0,W,1,A,S2) <cwr (P2,L0,R,1,A,S2)
to show that statement L0 in process P1 wrote a 1 into operand A in S2 (S2 is P2's view of storage) before statement L0 in process P2 read a 1 value from operand A in S2.

---------------------------------------------

The Compute-Write-Write Rule (CWW).

     Initially, A = 0.

         P1             P2
     L0: A = 1;     L0: A = 2;
If A = 2 (in both S1 and S2, that is, in both P1's and P2's view of storage) after this program has executed, then it must be that the write events in process P1 (for both S1 and S2) occurred before the write events in process P2. Represent this by
     (P1,L0,W,1,A,S1) <cww (P2,L0,W,2,A,S1)
     (P1,L0,W,1,A,S2) <cww (P2,L0,W,2,A,S2)
to show that statement L0 in process P1 wrote a 1 into operand A in S1 and in S2 before statement L0 in process P2 wrote a 2 into operand A in S1 and in S2, respectively.

---------------------------------------------

The Rules of Uniprocessor Order (UPO).

On a uniprocessor the parts of the following three code fragments (1) involving operations on X and (2) involving operations on the store for the process containing the code fragments, would always be executed in order.

     X = 1;     X = 1;     Y = X;
     Y = X;     X = 2;     X = 1;
Thus for this fragment
       Pi
     X = 1;
     X = 2;
we would have
     (Pi,-,W,1,X,Si) <upo (Pi,-,W,2,X,Si)
but UPO would not mean that (Pi,-,W,1,X,Si) would necessarily occur before any of the other write events in the second statement.

The necessity of this behavior is so obvious that UPO is treated like CMP as being part of the rules that all machines, even multiprocessors, automatically obey.

There is one case which is not so clear cut. If two fetches from the same operand are made, will they always occur in order? This seems to me to be a reasonable assumption, but others may not agree.

     Y = X;
     Z = X;
I have chosen to give this rule and the associated relation special names, URR and <urr, respectively.

---------------------------------------------

Order Rules.

Read-Read Order (RR)

If a machine obeys RR, and if a first read operation on shared data logically occurs in a program before a second read operation on shared data, then the first read operation is executed before the second read operation.

Write-Write Order (WW)

If a machine obeys WW, and if a first write operation on shared data logically occurs in a program before a second write operation on shared data, then the first write operation is executed before the second write operation.

Write-Read Order (WR)

If a machine obeys WR, and if a write operation on shared data logically occurs in a program before a read operation on shared data, then the write operation is executed before the read operation.

Many machines allow WR to be disobeyed.

Read-Write Order (RW)

If a machine obeys RW, and if a read operation on shared data logically occurs in a program before a write operation on shared data, then the read operation is executed before the write operation.

Program Order (PO)

A machine obeys PO if and only if it obeys RR, WW, RW, and WR.

---------------------------------------------

Rules of Cache Coherence.

CC1

All processes see each operand change value at exactly the same instant. CC1 is also called write atomicity (CC1).

CC2

All processes see exactly the same sequence of changes of values for all operands. CC2 is also called write synchronization.

CC3

All processes see exactly the same sequence of changes of values for each separate operand.

CC4

Different processes may see an operand assume different sequences of values.

---------------------------------------------

Some shorthand.

Suppose a system obeys CC1, and let w1 and w2 be any two write events in the same statement. If it is known that event x happens before w1, say, then, due to the write atomicity of the system, it must be that x happens before w2.

Symbolically, if x <hb w1, then x <hb w2.

This is inconvenient to write when one is trying to establish the existence of a circuit. A more economical notation is to write x <hb w1 =cc1 w2 to show that x occurs before w1 which occurs at the same time as w2.

---------------------------------------------

Architectures.

An architecture is the set of rules a machine obeys.

If a machine obeys rules R1, R2, ..., Rn, then denote by A(R1,R2,...,Rn) the architecture for the machine.

--------------------------------------------

Indistinguishable Machines.

Let machine M1 obey architecture A1, and M2 obey A2.

Suppose that every program which can calculate any given result on M1 can calculate the same result on M2, and vice versa.

Then M1 and M2 are indistinguishable.

This is written: M1 <=> M2.

Alternatively, A1 and A2 are indistinguishable.

Otherwise, M1 and M2 (A1 and A2) are distinguishable.

---------------------------------------------

Theorem: CC1 <=> CC2.

Let A1 be any architecture containing CC1, and A2 be any architecture containing CC2 (but not CC1).

Then A1 <=> A2. The architecture for Sun machines requires machines to be CC2. The documents cite this theorem as support for the machines being in fact CC1.

---------------------------------------------

How to Deduce a Violation.

How to test machine M to see if it violates the architecture A associated with test T?

  1. Assume that M obeys A.
  2. Run Test T.
  3. Analyze the output data. Use the rules in a to attempt to show that a circuit of events existed in time.
  4. If found, conclude that M does not obey A.
This is a reductio ad absurdum argument:
  1. Make an assumption.
  2. Deduce a falsehood.
  3. Conclude that the assumption was false.
---------------------------------------------

A simple example.


     Initially, A = B = U = V = 0.

           P1                P2
     L00: A    = 1;    L00: B    = 1;
     L01: -    = A;    L01: -    = B;
     L02: U    = B;    L02: V    = A;

     Terminally, A = B = 1, U = V = 0.

     (P1,L02,R,0,B,S1) <crw (P2,L00,W,1,B,S1) (<crw by CC1)
                       =cc1 (P2,L00,W,1,B,S2)
                       <upo (P2,L01,R,1,B,S2)
                       <rr  (P2,L02,R,0,A,S2)
                       <crw (P1,L00,W,1,A,S2)
                       =cc1 (P1,L00,W,1,A,S1)
                       <upo (P1,L01,R,1,A,S1)
                       <rr  (P1,L02,R,0,B,S1)
---------------------------------------------

Explanation of the circuit.

     (P1,L32,R,4,B,S1) <crw (P2,L50,W,5,B,S1)
.

     (P2,L50,W,5,B,S1) =cc1 (P2,L50,W,5,B,S2)
By write atomicity the write into B in S1 occurred at the same instant as the write into B in S2.

     (P2,L50,W,5,B,S2) <upo (P2,L51,R,5,B,S2)
By uniprocessor order the write into B in S2 in statement L50 occurred before the read from B in S2 in statement L51.

     (P2,L51,R,5,B,S2) <rr  (P2,L52,R,2,A,S2)
By read order the read event in statement L51 occurred before the read event in statement L52.

The second half of the circuit repeats the reasoning for the first half.

The existence of the circuit shows that the machine failed to obey A(CMP,UPO,RR,CC1).

---------------------------------------------

Test T700. Seek a relaxation of A(CMP,UPO,RR,CC1).

     Initially, A = B = U[i] = V[i] = 0, i = 0 to K.

           P1                P2
     L00: A    = 0;    L00: B    = 0;
     L01: -    = A;    L01: -    = B;
     L02: U[0] = B;    L02: V[0] = A;
     L10: A    = 1;    L10: B    = 1;
     L11: -    = A;    L11: -    = B;
     L12: U[1] = B;    L12: V[1] = A;
     L20: A    = 2;    L20: B    = 2;
     L21: -    = A;    L21: -    = B;
     L22: U[2] = B;    L22: V[2] = A;
     L30: A    = 3;    L30: B    = 3;
     L31: -    = A;    L31: -    = B;
     L32: U[3] = B;    L32: V[3] = A; etc.

     Seek 7.1. U[i] < j and V[j] < i.  d1 = V[ U[i]+1 ] - i < 0.
     Seek 7.2. V[i] < j and U[j] < i.  d2 = U[ V[i]+1 ] - i < 0.

       To show:  Not A(CMP,UPO,RR,CC1).
If a circuit is found, then d measures the length of the circuit.

Test T7 fails on many machines. The reason is that fetches are made from the store buffer, and the store buffer is not subjected to the same MESI discipline as the cache. ---------------------------------------------

Results of running ARCHTEST.

Of 26 machines tested so far, 6 were sequentially consistent (SC).

Eight machines relaxed the rule of WR, that is, they allowed reads to occur before logically preceding writes Call these machines POK machines.

Twelve machines relaxed both WR and CC1. This has been called processor consistency (Pcon).

More into at: http://www.mpdiag.com/results.html. ============================================

============================================

Extensions to ARCHTEST.

I seek feedback on what might be most useful to others.

---------------------------------------------

Files as sink operands.

Instead of writing into operands in shared memory, write into shared files.

Goal: See the results of subjecting both operating system and hardware to the same tests.

---------------------------------------------

Test real shared memory programs on real SMP systems.

What programs access shared memory?

Ans: Critical section routines.

(Are there others? Probably very, very few.)

Most critical section routines use atomic instructions but there are several in the literature which do not:

  1. Various notes in the CACM about 1965 and after.
  2. Algorithms for Mutual Exclusion by M. Raynal, MIT Press, 1986.
Will put this work on the web with a copyleft statement. Hope others will play with it.

Question: can critsec routines run correctly on a PCon machine?

---------------------------------------------

Generate invalid data (I).

Problem: how to test ARCHTEST, that is, how to tell that it correctly interprets the data generated by a machine?

It is easy to generate data that appears to come from a SC machine. ARCHTEST does this and correctly concludes that the machine is SC.

How to generate data that appears to come for a non-SC machine?

At present ARCHTEST generates valid data and then modifies it slightly at random points. Hand-checking shows that ARCHTEST correctly picks out the errors.

---------------------------------------------

Generate invalid data (II).

Proposal: write a program to generate data for each test in ARCHTEST as if it came from a machine obeying any possible architecture (within the confines of the rules described above). Put it on the web with a copyleft statement.

This program would help ensure that ARCHTEST analyzes data correctly. It could also be of use to other researchers.

---------------------------------------------

Program to prove CC1 <=> CC2.

In RAPA I proved that CC1 <=> CC2, that is, any machine that is CC2 also appears to be CC1.

One important lemma necessary for the proof had 224 subcases. I wrote a program in Pascal to generate and to check each subcase.

I will be rewriting this program in C and making it available with a copyleft statement in order for others to build on it.

---------------------------------------------

Garbage.


Prev Contents Next