Research

USIMM: the Utah Simulated Memory Module

I led the development effort for USIMM, a cycle-accurate DRAM simulator developed at the Utah Arch lab. The simulator has a trace-based design and a simplistic OOO core model able to simulate multi-programmed and multi-core models. The simulator models all DRAM commands and timing constraints as well as the different DRAM power modes. USIMM has a clean interface that allows quick development and testing of various memory scheduling policies. The simulator was publicly distributed for the 3rd JILP Workshop on Computer Architecture Competitions (Memory Scheduling Championship) held with ISCA at Portland in June 2012 and used as the infrastructure by the competing groups from many different universities and from the DRAM industry. The simulator, coupled with SIMICS has been used by our group for our paper in MICRO-2012 and by the computer architecture group at Georgia Tech for a recent HPCA-2013 paper. If you are on the lookout for an easy to use DRAM simulator please check out USIMM.

Architecting a Heterogeneous DRAM System

The DRAM main memory system in modern servers is largely homogeneous. In recent years, DRAM manufacturers have produced chips with vastly differing latency and energy characteristics. This provides the opportunity to build a heterogeneous main memory system where different parts of the address space can yield different latencies and energy per access. The limited prior work in this area has explored smart placement of pages with high activities. In this paper, we propose a novel alternative to exploit DRAM heterogeneity. We observe that the critical word in a cache line can be easily recognized beforehand and placed in a low-latency region of the main memory constructed with Reduced Latency DRAM (RLDRAM). Other non-critical words of the cache line can be placed in a low-energy region consisting of Low-Power DRAM (LPDDR). We design an architecture that has low complexity and that can accelerate the transfer of the critical word by tens of cycles. For our benchmark suite, we show an average performance improvement of 12.9% and an accompanying system energy reduction of 6%.

Handling DRAM Writes

Given that reads are on the critical path for CPU progress, DRAM reads must be prioritized over DRAM writes. However, writes must be eventually processed and they often delay pending reads. In fact, a single rank in the main memory system offers very little parallelism between reads and writes. This is because a single off-chip memory bus is shared by reads and writes and the direction of the bus has to be explicitly turned around when switching from writes to reads (targeting the same DRAM chip). This is an expensive operation and its cost is amortized by carrying out a burst of writes or reads every time the bus direction is switched.As a result, no reads can be processed while a memory channel is busy servicing writes. In our work, if some of the banks are busy servicing writes, we start issuing reads to the other idle banks. The results of these reads are stored in a few registers near the memory chip's I/O pads. These results are quickly returned immediately following the bus turnaround. The process is referred to as Staged Read because it decouples a single read operation into two stages, with the first step being performed in parallel with writes. This innovation can also be viewed as a form of prefetch that is internal to a memory chip.

Managing DRAM Overfetch

Several studies have attributed about 25-40% of the power consumed in large datacenters to their main memory (DRAM) subsystem. Thus besides the traditional problem of memory latency wall we now have to contend with the power wall. The decreased locality of memory accesses (due to multi-cores) has rendered the large bank-level row-buffers used in traditional DRAMs as a major overhead. On each memory request, a large amount of data is read into the row-buffers, but only a small fraction is actually utilized. We try to mitigate this overfetch problem to increase DRAM efficiency.

Power Aware DRAM design

We look at a fundamental redesign of the DRAM architecture and the data layout scheme. We try to reduce the amount of overfetch on each access ( by activating fewer bitlines and and fewer memory arrays) and thereby save dynamic power.

Energy and Performance optimized DRAM design

The problem of overfetch is related to the traditional method of mapping OS pages to DRAM row-buffers. In a multi-core environment accesses to a page are clustered within a few cache-blocks in that page. We try to colocate such frequently accessed fragments of a page in a DRAM row-buffer to increase row-buffer reuse to aid in performance and power efficiency.

Hardware Acceleration for MPI primitives on CMPs

Processors with a large number of cores will be most lucrative to the HPC community. In the foreseeable future, a large amount of the legacy code based on the Message Passing Interface (MPI) will run on these many-core machines. However, these processors are optimized for shared-memory programs and thus the MPI programs have to use shared-memory as the underlying mechanism of message passing - which is clearly inefficient. Our work tries to eliminate the characteristics of shared-memory that hinder MPI programs and uses the excess transistor budget of the future to provide architectural support for accelerating the MPI communication.