Myrinet Synchronized Multichannel API


Contents



Motivation

The Virtual Prototyping project requires low latency communication between a SGI OCTANE workstation running IRIX and a PowerPC 604 single-board computer running the real-time OS VxWorks.

The OCTANE updates a visual display of a virtual world, and the PowerPC handles the interaction of a robotic manipulator with this virtual world. The PowerPC supplies the OCTANE with information about the position and orientation of each object and the manipulator in the virtual world, and the OCTANE supplies the PowerPC with the global closest point, i.e. the point out of all the object surfaces which is closest to the manipulator's end effector. (Other communication does occur, but these are the most latency sensitive.)

Both computers perform their work in cycles, and need to know only the most recent values for the data sent to them. But it is also important that the visual simulation receive a set of positions that are associated. For example, if the manipulator has grasped and is moving an object, it is important for the visual simulation to receive position information that is sychronized. The position information for the manipulator and the object should be derived from one particular cycle of the PowerPC's calculations. If they are not, the visual simulation will show them moving with respect to one another, when the PowerPC is actually moving them in unison.

To use a byte-stream protocol such as TCP/IP, we would have to wrap it with a protocol which would extract the data for the most recent complete cycle. Using conventional shared memory (say, using SBS Bit-3's broadcast memory system) would eliminate the network overhead, but would require additional hardware, as well as implementing a 'rate-matching' scheme to synchronize the channels and provide a consistent set to the reader.

The API I describe below is an attempt to implement low overhead, low latency replicated shared memory communication which provides for the synchronization of several channels of data, using myrinet as the communications fabric. I use MSM (Myrinet Synchronized Multichannel API) to label this method. While using myrinet hardware in this way requires no less an investment in special purpose hardware than a hardware shared memory implementation, and though the hardware shared memory approach would involve much less software engineering, this approach uses hardware we already have. Further, other research efforts here involve low latency protocols using myrinet. This work will give me experience badly needed before tackling these other low latency protocols.

The Flavor

The intent of this api is to provide a reader and writer each with a block of memory consisting of N objects of size M bytes (where N is the number of channels and M is the size of each channel) which can be written by the writer and read by the reader. The reader and writer both perform their reads/writes of this memory inside begin_frame/end_frame function calls. The reader is guaranteed to see a snapshot of the writer's memory as of the writer's most recent end_frame call.

The API

void msm_err(char *s, int error);
	
Prints a message describing the error given by the parameter. Prepends the string s to this message. Works like perror does for standard error codes.
int
msm_connect_write(int local_unit,
	          int local_port,
		  int reader_id,
	          int reader_port,
	          int number_of_channels,
                  int size_of_each_channel,
                  int *err);

int 
msm_connect_read(int local_unit,
		 int local_port,
                 int writer_id,
                 int writer_port,
                 int number_of_channels,
                 int size_of_each_channel,
                 int *err);
	
Connects to a remote host as a writer or reader. Since each host can have more that one myrinet card, you have to specify which one to use with the local_unit parameter. Each interface in the host is numbered, and unit 0 shows up as /hw/myri0 in IRIX, unit 1 as /hw/myri1, etc. In addition, each interface on the network has a unique id number. You specifiy the target partner using this id (writer_id or reader_id). Since many things can go wrong, an error code is returned in err which can be examined using msm_err(). A non-negative connection id is returned on success, and -1 is returned on failure.

The interface id's of all myrinet equipped hosts will appear in a human-readable routing file placed in some standard location.

int
msm_begin_frame(int cid);

int
msm_end_frame(int cid);
Marks the beginning and ending of a frame. The msm_begin_frame call will block until (for writers) all the new channels have been sent to the partner, or (for readers) the channel memory has been updated to reflect the most recent complete frame received.

Returns 0 on success, -1 on failure. Should fail only if cid is not valid.

Note: msm_begin_frame can block. For readers, it will only block until the DMA operations which bring new data into memory are complete. For writers, it will block until all new data in the last frame has been correctly received and acknowledged by the receiving interface card. Note that this has nothing do with what the partner process is doing; the receiving interface card receives and acknowledges packets independent what any user process is doing. If the receiving host has crashed (and thus the receiving interface card cannot acknowledge), this call will block for about 10 seconds while the system tries to resend. After that, the status will go to the unconnected state, and the begin_frame call will return. The connection will be reestablished when the receiving host reboots and the partner process restarts.

msm_channel_write(int cid,
		  int channel_number,
                  void *buffer);

msm_channel_read(int cid,
                 int channel_number,
                 void *buffer);
	
The write function will copy a number of bytes equal to the size of the channel in bytes from buffer into the appropriate channel. The read function will copy the new data into buffer ONLY if this channel had new data this frame.

The write function returns 0 on success. The read function returns a status integer indicating if the data copied from this channel was new, valid, or neither. Both return -1 on failure, which occurs only if cid is invalid, the channel_number is invalid, or the connection was not the proper type.

void *
msm_buffer_address(int cid);
	
This API function returns a pointer to the memory containing the channels so you can access it directly, perhaps overlaying objects onto it, to support zero-copy operation. Use in conjuction with msm_mark_write() and msm_check_read(), as demonstrated below.

Will return null if the cid was invalid.

int
msm_check_read(int cid,
               int channel_number);
	
(Use on reader connections only.) Returns an integer describing the status of the data in the given channel. If the status ANDED with MSM_VALID is nonzero, the data is valid, i.e. contains data sent from the writer. Otherwise, the channel will contain zeroes. This is an indication that during the life of the connection, the writer has not sent any data on this channel.

If you don't like zeros for the default value for channels which haven't received data, you can grab a pointer to the buffer with msm_buffer_address() and set the channel memory to whatever you want. (Make sure you do this between the begin/end calls.) The data will stay as you set it until the writer sends data for those channels.

If the status ANDED with MSM_NEW is nonzero, the data in this channel is new as of this frame.

Returns status on success, -1 on failure. Will fail if cid or the channel number is bad, or the connection is not a reader.

int
msm_mark_write(int cid,
	       int channel_number);
	
(Use on writer connections only.) If you change the data in a channel by writing into the memory area returned by msm_buffer_address, you must inform the API of the change by calling msm_mark_write().

Returns 0 on success, -1 on failure. Fails if cid or the channel number is invalid.

int
msm_disconnect(int cid);  
	
Disconnects the given connection. All resources associated with the connection are released. In particular, the memory location located by a call to msm_buffer address will no longer be available after a call to msm_disconnect().

Returns 0 on success, -1 on failure. Fails if cid is invalid.

int msm_status(int cid);
	
Returns an integer describing the status of the connection. If bit one is set, the partner process is connected and receiving or transmitting data. If not, the parter process is not connected, although the rest of the api continues to function normally.

For reader connections only: if bit 2 is set, then at least one of the channels has new data this frame. If bit 2 is not set, none of the channels has new data this frame.

Returns 0 on success, -1 on failure. Fails if cid is invalid.

In order for two hosts on a myrinet to communicate using this method, one must call msm_connect_read and the other must call msm_connect_write. The writer targets the reader's host and port, and vice versa. If the reader and writer agree on port, channel size and number of channels, this establishes a one-way transmision of the requested number of channels. To establish two-way communication, set up two one-way channels. You can establish as many connections in either direction as the fixed number of ports per interface (currently 8) and memory will allow. Each connection can have any number and size of channels. (Both the number of channels and the size of each channel in bytes must be nonzero and divisible by four.)

This does not follow a client/server model; either reader or writer can be called first. The writer can proceed to write before the reader connects. Once the reader does, he will receive the most recent values written. The reader can also proceed to read before the writer connects. The result will be that none of the channels will get new data until the writer connects. The design also allows for either reader or writer to crash or exit without disturbing his partner. The connection will be reestablished when the partner rejoins.

(Annoying complication: VxWorks is not a secure OS; that is, application crashes often crash the OS, and most of the time an application crash means a reboot. So this implementation has to handle reboots of any host as well.)

Example

For example, the simplest pattern of operation is described below.

I assume that there are two process running on two separate hosts connected to the Myrinet network. Since each host can have more than one myrinet interface, I assume that each process uses the first unit (unit 0) on their respective hosts, and that Host A's unit is given the interface id 0, and Host B's unit is given the interface id 1. The port numbers were chosen to match the interface id, but we could have picked any from 0 to 7.

Writer

{

   Transform xform[100];     /* 100 transform objects, each 48 bytes */


   cid = msm_connect_write(0,       /* unit 0; appears as /hw/myri0 in IRIX */
			   0,       /* port 0 on this unit
			   1,       /* the remote interface id */
			   1,       /* port 1 on interface 1 */
			   100,     /* 100 channels, 0-99 */
			   48);     /* 48 bytes in a channel */


   while (...) {
      msm_begin_frame(cid)             /* i'm ignoring some return values for now.
                                          they will signal errors.  */

      update(xform);                   /* update all transforms */

      msm_channel_write(cid,
	   	        0,             /* channel 0 */
		        &xform[0]);      /* write new xform0 to channel 0 */

      msm_channel_write(cid,
	   	        1,             /* channel 1 */
		        &xform[1]);      /* write new xform1 to channel 1 */
  
      msm_channel_write(cid,
		        34,             /* channel 34 */
		        &xform[34]);      /* write new xform34 to channel 34 */

      msm_end_frame(cid);               /* only now will the data be seen by
                                           the reader; these three xforms 
				           will be modified simultaneously on
					   the reader's side.  */

   }


   msm_disconnect(cid)
}

Reader

{
   Transform xform[100];     /* 100 transform objects, each 48 bytes */

   cid = msm_connect_read(1,     /* unit 1 on this host; shows up as /hw/myri1 under IRIX */
			  1,     /* port 1 on this unit */
			  0,     /* expects data from interface id 0 */
			  0,     /* and port 0. */
                          100,   /* 100 channels */
	     	          48)    /* 48 bytes */

   while (...) {

	msm_begin_frame(cid);

	for (int i=0; i<100; i++) {

	    /* if there is new data on a channel,
               copy it into the associated xform. */

	    if ( 0 < msm_channel_read(cid,
		 	             i,
			             &xform[i])) {
		printf(" xform %d updated this cycle.\n", i);
	    }
	    else { /* error: return value of -1 means the 
                      cid was bad, the channel number was bad, 
                      or the channel was not a reader. */ 
            }

	}

	msm_end_frame(cid); 

	/* do other work here.
	   RESTRICTION: it is important that you do not access the 
	   channels (using channel_read or check_read outside of the begin/end, 
	   because the data is being updated by the myrinet interface card.

           Doing other work is important, because the end_ call triggers
           an update of all channels, which may take some time.
           On the next loop, the begin_ call will block until the interface is 
           done working.  If you do work here that does not
           involve communication, you maximize the parallelism.
         */

     }


   msm_disconnect(cid);
}

The begin and end functions define the region of code in which it is safe to read or write the channels. If you attempt to read or write channels outside of these 'brackets', you will corrupt your data. They also serve to enforce the boundaries for data consistency: all writer updates between a begin and an end will be seen as one update by the reader.

The channel_read and channel_write functions copy from the area pointed to by their third argument into an internal buffer. You can reduce the protocol to zero-copy if you use this internal buffer directly. The mark, check, and buffer_address functions facilitate this.

The msm_buffer_address function returns a pointer to this internal buffer. It holds the current values for each channel in order, so it looks like this, where s is the size of one channel in 32-bit words:

			offset(words)      data

                             0             channel 0, word 0
                             1             channel 0, word 1
			     2             channel 0, word 2
                                .
                                . 
                             s-2           channel 0, word s-2
                             s-1           channel 0, word s-1

                             s             channel 1, word 0
                             s+1           channel 1, word 1
				.
				.
			     2s-2          channel 1, word s-2
			     2s-1          channel 1, word s-1
                             2s            channel 2, word s
So, I've just gone overboard explaining something simple. You can write to or read from these locations directly, bypassing the copy.

Here is the above example rewritten for zero-copy operation.

Writer

{
   cid = msm_connect_write(0,       /* unit 0 */
			   0,       /* port 0 on this unit */
			   1,       /* the remote interface id */
			   1,       /* port 1 on interface 1 */
			   100,     /* 100 channels, 0-99 */
			   48);     /* 48 bytesin a channel */

   Transform *xform = (Transform *)msm_buffer_address(cid);

   while (...) {
      msm_begin_write(cid)             /* i'm ignoring some return values;
                                          they signal errors.  */

      update(xform);                   /* now the updates occur directly to
                                          the internal buffer */	      

      /* now inform the reader of  changes on xforms 0,1,and 34 */

      msm_mark_write(cid,0);          /* mark each channel as written.    */
      msm_mark_write(cid,1);          /* You have to do this because      */
      msm_mark_write(cid,34);         /* the api doesn't know you fiddled with the
                                         internal buffer. */

      msm_end_write(cid);               /* only now will the data be seen by
                                           the reader; all xforms 
				           will be modified simultaneously on
					   the reader's side.  */
   }

   msm_disconnect(cid)
}

Reader

{
   cid = msm_connect_read(1,     /* unit 1 */
			  1,     /* port 1 on this unit */
			  0,     /* expects data from interface id 0 */
			  0,     /* and port 0. */
                          100,   /* 100 channels */
	     	          48)    /* 48 bytes */

   Transform *xform = (Transform *)msm_buffer_address(cid);

   while (...) {

	msm_begin_frame(cid);

	for (int i=0; i<100; i++) {

	    if ( 0< msm_check_read(cid,i) ) {
		printf(" xform %d was updated this cycle.\n", i);
	    }
	    else { 
		/* error: return value of -1 means the cid 
                   was bad, the channel number was bad, 
                   or the channel was not a reader. */
            }
	}

       /* do any work involving the buffer (or equivalently
          the xforms) here. */

	msm_end_frame(cid); 

	/* do other work here.

	   RESTRICTION: it is important that you do not access the 
	   buffer/xforms outside of the begin/end, because they are 
	   being updated by the myrinet interface card.

         */

     }

   msm_disconnect(cid);
}

Current Status

Right now only robocop (the IRIX 6.4 Octane) and vxw0 (the VxWorks PPC MV2604) are set up to use the API, but Zan and Jayna (IRIX 6.5 Origin 200) and the O2K could easily be set up to use it.

To use the API under IRIX, include the msmapi.h file and link with the msmapi.o file. (msmapi.c is available as well, but you'd need my whole myrinet distribution to compile it.) All access to the network is memory protected, and all resources are properly reclaimed if a user program crashes. Port usage is granted on a first come, first served basis, as with TCP. Connections persist across a fork, but the API does not single-thread any of the functions. A valid connection id can be used by any process in the share group. The user must ensure that only one thread or process at a time accesses the begin, end, write, or mark functions. Any number of threads/processes can access the read channel or check read functions at once, as long as these accesses take place (temporally) between begin and end calls. The port in use will remain in use until the last process using it calls disconnect or exits.

To use it under VxWorks, simply include the msmapi.h file in your code; VxWorks will preload the library after booting. A connection id can be used by any task. Exactly one task must call the disconnect function for resources to be reclaimed properly. The VxWorks implementation cannot provide memory protected access to the network, since we don't have the virtual memory extentions to VxWorks. Also, the VxWorks implementation is not guaranteed to clean up network resources properly if the user code crashes. You should just reboot if your VxWorks code crashes. It is safe to do this even if other hosts are still connected and using the API.

Routing/Hostname Issues

The system supports up to 256 hosts on one myrinet LAN. Each network interface is given an interface id (0 <= id <= 255). The robotics lab has only three interfaces:

    Host/Unit              interface id
    Robocop, unit 0             0
    Robocop, unit 1             1
    vxw0   , unit 0             2
The routing is programmed into the interface cards after the host boots. The routing table is kept in a human-readable file (/home/robotics/msmroutes or /res/robotics/msmroutes).

Performance

The following tests were performed on a two-processor SGI OCTANE equiped with two 512MB LANai 4.1 Myrinet interface cards. The testing process acts as both reader and writer, and runs on one isolated and restricted processor. The values given in the plot below are average latencies for one new channel to become visible to the reader; the minimum latency tends to be ~15% less than this average, and the maximum measured latency was almost always less than three times the average. For comparison, with the same hardware configuration and software test, the Myrinet API achieves a minimum latency of 105 microseconds for a 4 byte packet. The plot compares two versions of the software. The first assumes that the hardware is reliable. The second version ensures reliable delivery. The plot shows that ensuring reliable delivery in software has cost about 4 microseconds per packet.

The next test used 128 channels of 48 bytes each. The writer sent new data on a fixed number of channels each frame. The plot shows the latency as a function of the number of new channels available each frame for both versions of the software.


.

[ RIP Work | Robotics Lab | Schedule | Personal Info | Home ]