Hari Sundar


One sided communications in MPI 2.0 open up a lot of possibilities towards simplifying parallel code and potentially improved performance. In this post I shall consider one such case and explain the benefits of using one-sided communications over typical two-sided constructs. As always, the effectiveness and efficiency of these constructs depend on the underlying MPI implementation. I shall use the MPI C interface in this post. I shall test the performance of using one-sided communications against traditional approaches on Stampede and Titan.

The one-sided constructs require the use of abstract objects called windows which specify regions of a process's memory which have been made available for remote operations by other MPI processes. Windows are created using a collective operation (MPI_Win_create) called by all processes within an MPI_Comm, which specifies a base address and length. The one sided communication constructs are MPI_Put, MPI_Get and MPI_Accumulate. MPI_Put writes(puts) data into a memory window on a remote process, while MPI_Get reads(gets) data from a memory window on a remote process. MPI_Accumulate accumulates data, based on a user-specified operation, into a memory window on a remote process.

Case 1: Communicating Send Counts

First let us consider a common scenario is parallel programming: we need to communicate with arbitrary number (k) of processes. Each process can locally compute which processes it needs to communicate how much with (count, datatype). Because of the 2-sided limitations, we need to figure out how much and from whom each process needs to receive. This is usually performed using an MPI_Alltoall where each process send one integer to all other processes, which records how much this process will send to the other process. This can be followed by either an MPI_Alltoallv or point-to-point communications depending on the density of the exchange. If the number of processes that each process sends to, k, is small, one would expect using MPI_Put to be more efficient than MPI_Alltoall at large process counts. The implementation using MPI_Alltoall is very simple, here using random processes and counts,


for(size_t i = 0; i < npes; ++i) {
  send_counts_[i] = 0;
}
for(size_t i = 0; i < k; ++i) {
  send_counts_[rand_r ( &seed ) % npes] = rand_r(&seed);
}	

MPI_Alltoall (send_counts, 1, MPI_INT, recv_counts, 1, MPI_INT, comm);

The corresponding code for using MPI_Put is a bit more involved, due to the need to create a MPI_Win.


MPI_Win_create(recv_counts, npes, sizeof(int), MPI_INFO_NULL,  MPI_COMM_WORLD, &win);
MPI_Win_fence(MPI_MODE_NOPRECEDE, win);

for (size_t i = 0; i < k; i++) {
  int partner = rand_r ( &seed ) %npes;
  unsigned int send_size = rand_r( &seed );
  MPI_Put(&(send_size), 1, MPI_INT, partner, rank, 1, MPI_INT, win);
}	 

MPI_Win_fence((MPI_MODE_NOSTORE | MPI_MODE_NOSUCCEED), win);		
MPI_Win_free(&win);