Abstract
Power consumption and DRAM latencies are serious concerns in modern
chip-multiprocessor (CMP or multi-core) based compute systems. The
management of the DRAM row buffer can significantly impact both power
consumption and latency. Modern DRAM systems read data from cell
arrays and populate a row buffer as large as 8 KB on a memory request.
But only a small fraction of these bits are ever returned back to the
CPU. This ends up wasting energy and time to read (and subsequently
write back) bits which are used rarely. Traditionally, an open-page
policy has been used for uni-processor systems and it has worked well
because of spatial and temporal locality in the access stream. In
future multi-core processors, the possibly independent access streams
of each core are interleaved, thus destroying the available locality
and significantly under-utilizing the contents of the row buffer. In
this work, we attempt to improve row-buffer utilization for future
multi-core systems.
The schemes presented in this work are motivated by our observations
that a large number of accesses within heavily accessed OS pages are
to small, contiguous "chunks" of cache blocks. Thus, the co-location
of chunks (from different OS pages) in a row-buffer will improve the
overall utilization of the row buffer contents, and consequently
reduce memory energy consumption and access time. Such co-location can
be achieved in many ways, notably involving a reduction in OS page
size and software or hardware assisted migration of data within DRAM.
We explore these mechanisms and discuss the trade-offs involved along
with energy and performance improvements from each scheme. On
average, for applications with room for improvement, our best
performing scheme increases performance by 9% (max. 18%) and reduces
memory energy consumption by 15% (max. 70%).
(To appear in the proceedings of 15th International Conference on
Architectural Support for Programming Languages and Operating Systems
(ASPLOS-XV), Pittsburgh, March 2010.)