Exploring a Brink-of-Failure Memory Controller to Design an Approximate Memory System

Meysam Taassori  Niladrish Chatterjee†  Ali Shafiee  Rajeev Balasubramonian
University of Utah, † NVIDIA

Abstract

Nearly every synchronous digital circuit today is designed with timing margins. These timing margins allow the circuit to behave correctly in spite of parameter variations, voltage noise, temperature fluctuations, etc. Given that the memory system is a critical bottleneck in several workloads, this paper attempts to safely push memory performance to its limits by dynamically shoving the timing margins inherent in memory devices. This is implemented with an adaptive memory controller that maintains timing parameters for every bank and gradually pushes the memory system towards the brink of failure. Each bank may be handled differently, depending on the extent of parameter variation. Such an approach may lead to occasional run-time errors. Additional support for ECC or chipkill may help the memory system recover from errors that are introduced by an overly aggressive memory controller. This is a much stronger capability than the limited form of memory over-clocking that can be deployed today. We believe that such a brink-of-failure memory controller can form the basis for an approximate memory system. Memory timing parameters can be easily adapted per memory region or per memory operation, enabling easy tuning of the performance-precision trade-off for approximate computing workloads. The preliminary analysis in this paper serves as a limit study to understand the impact of memory timing parameters on application throughput.

1 Introduction

Commercial computer systems are designed to provide reasonably high levels of reliability. The commercial viability of a system that crashes every day is zero. Hence, systems are over-provisioned in many regards so they are not operating on the brink of failure. For example, power supplies on a board or power delivery networks on a chip can handle a little more than the maximum expected power draw. In a similar vein, nearly every synchronous digital circuit on a processor or memory chip today is designed with timing margins. These timing margins allow the circuit to behave correctly in spite of parameter variations, voltage noise, temperature fluctuations, etc.

There is however one niche market segment that operates the system near the brink of failure to eke out very high performance. Gaming enthusiasts frequently resort to processor and memory “over-clocking” [20, 40, 52]. Processor and memory vendors expose a few parameters that can be set at boot-up time in the BIOS to allow a system to operate at frequencies higher than those in the specifications. For example, memory over-clocking today allows a change to the memory bus frequency, the DIMM voltage, and three DRAM timing parameters (tRP, tRCD, tCL) [52]. This is an effective coarse-grained approach to shrink timing margins and boost performance, while trading off some reliability.

In this work, we attempt to bring the brink-of-failure (BOF) approach to mainstream computer systems in a safe and architecturally controlled manner, with a primary focus on the main memory system. We propose an adaptive memory controller that can use more aggressive timing parameters for various regions of memory or for specific memory instructions. The memory system can also be augmented with some form of error detection/correction support to track error rates and recover from errors when possible. A more aggressive memory controller yields higher performance while introducing a few errors. The timing parameters are adjusted based on observed error rates and application requirements. This enables fine-grained control of the memory system and the performance-precision trade-off.

The proposed BOF memory controller has two primary applications. It serves as an important component in an approximate computing system [44]. It also helps extract the highest performance possible from memory systems that suffer from large parameter variations [29].

This project is at a very early stage. As a first step, this paper quantifies the impact of various timing parameters on application performance, thus showing the potential room for improvement. We believe that this is an important area for future work, requiring support from the hardware, operating system, programming models, and applications to fully realize its potential.

2 Background

In modern server systems, a single processor socket has up to four memory controllers. Each memory controller drives a single DDR (double data rate) memory channel that is connected to one or two DIMMs. DRAM chips on a DIMM are organized into ranks; memory controller commands are sent to all DRAM chips in one rank and all the DRAM chips in a rank together provide the requested data.

DRAM Timing Parameters

A DRAM chip has little logic or intelligence; for the most part, it simply responds to the commands received from the memory controller. The memory controller has to keep track of DRAM state and issue commands at the right times.

Memory is designed to be an upgradeable commodity. When a DIMM fails or when more memory capacity is
required, the user can pull out a DIMM from the motherboard and replace it with another new DIMM. A single memory controller must work correctly with all possible DIMMs. JEDEC is an industry consortium that specifies a number of standard timing parameters (e.g., \(t_{\text{RC}}\), \(t_{\text{RP}}\), \(t_{\text{RFC}}\)) that govern every memory device. These specifications are referred to as the JEDEC standard. Processor companies then design memory controllers that manage each of these timing parameters. When a DIMM is plugged in, the memory controller reads in the values for each timing parameter from the DIMM. These values are then used to appropriately schedule commands to the DIMM. For example, the JEDEC standard specifies that \(t_{\text{RFC}}\) is the time taken to perform one refresh operation. After issuing a refresh operation, a memory controller is designed to leave that memory device alone for an amount of time equaling \(t_{\text{RFC}}\). Some DIMMs have a \(t_{\text{RFC}}\) of 160 ns, some have a \(t_{\text{RFC}}\) of 300 ns; this is determined by reading a register on the DIMM at boot-up time. There are more than 20 such timing parameters.

The JEDEC standard has been defined to facilitate easy adoption by all memory and processor vendors. It therefore errs on the side of being simple, while potentially sacrificing some performance. For the refresh example above, in reality, the memory controller can safely schedule some memory operations as the refresh operation is winding down, i.e., the memory controller can resume limited operation before the end of \(t_{\text{RFC}}\). JEDEC could have specified a number of additional timing parameters to capture this phenomenon and boost performance. But putting this in the JEDEC standard could complicate every memory controller, even the simplest ones on embedded devices.

**DRAM Commands and Microarchitecture**

The memory system offers a high degree of parallelism. Each channel can perform a cache line transfer in parallel. Each DRAM chip is itself partitioned into eight independent banks. Therefore, the collection of DRAM chips that form a rank can concurrently handle eight different operations. To fetch data from a bank, the memory controller first issues an Activate operation – this fetches an entire row of data into the bank’s row buffer (a collection of sense-amps). The memory controller then issues a Column-Read command to fetch a 64-byte cache line from the row buffer. The memory controller can also fetch other cache lines in the row buffer with additional Column-Read commands; these low-latency operations are referred to as row buffer hits. Before accessing a different row of data in the same bank, the memory controller has to first issue a Precharge command that clears the row buffer and readies the bitlines for the next Activate command.

**Memory Controllers**

When there’s a miss in the processor’s last level cache, memory transactions are queued at the memory controller. The memory controller maintains separate read and write queues. Reads are given higher priority. Writes are drained when the write queue size exceeds a high water mark. The memory controller has a sophisticated scheduler that re-orders the transactions in the queue to improve response times and throughput. The scheduler attempts to maximize row buffer hits and bank utilizations, hide refreshes and writes, and prioritize high-throughput threads while maintaining fairness. Once a request is selected, it is decomposed into the necessary Precharge, Activate, Column-Read sequence and the commands are issued as long as no timing parameter is violated.

**Memory Reliability**

Most server memory systems include support to detect and recover from hard and soft errors in memory. Such servers typically use ECC-DIMMs, where the DIMM is provisioned with additional DRAM chips that store error correction codes. The most common error correction code is SEC-DED (single error correct, double error detect). SEC-DED adds an eight bit code to every 64-bit data word that enables recovery from a single bit failure in that word. To handle failure of multiple bits in a 64-bit data word, stronger codes are required. One example is chipkill, that allows recovery from complete failure in one DRAM chip. Early chipkill implementations [1, 15, 37] incurred high overheads. More recently, new chipkill algorithms and data layouts have been developed that greatly reduce these overheads [24, 25, 53, 56].

**Parameter Variation**

It is well-known that parameter variation grows as device dimensions shrink [39]. Over the last decade, a number of papers have proposed techniques to cope with parameter variations in the processor and SRAM caches (e.g., [4, 7, 13, 17, 32, 34, 51]). Various schemes have been used to detect at run-time that timing margins in processors have shrunk and that voltage/frequency adjustments are required. For example, Razor [17] uses a delayed latch, Lefurgy et al. [32] monitor critical paths, and Bacha and Teodosescu [7] use functional unit ECCs to detect timing violations. Few papers have examined parameter variation in DRAM even though it has been identified as a significant problem [6, 16, 21, 27, 29, 33, 36, 57].

Parameter variations in logic and DRAM processes are typically attributed to changes in the effective gate length that are caused by systematic lithographic aberrations, and changes in threshold voltage that are caused by random doping fluctuations [6, 39, 57]. Most prior works on DRAM parameter variation have focused on how it impacts cell retention time [21, 27, 33, 36]. Correspondingly, techniques have been proposed to refresh different rows at different rates [35, 36]. Wilkerson et al. [55] lower the refresh rate in eDRAM caches and overcome errors in weak cells with strong error correction codes. The work of Zhao et al. [57] develops a parameter variation model for 3D-stacked DRAM chips and quantifies the expected variation in leakage and latency in different banks. They show that parameter variation can cause
most banks to have data read latencies that fall in the range of 12-26 cycles. The authors then propose a non-uniform latency 3D-stacked DRAM cache. With the variation profile described above, a commodity DRAM chip today would have to specify a conservative uniform latency of 30 cycles even though several requests can be serviced in half the time. Another study with an older DRAM process shows that latencies in different banks can vary by 18 ns because of parameter variation [30]. A more recent paper from Hynix shows that for 450 DRAM samples, the delay variation for circuits within a single wafer is almost 30% [29]. The delay variation grows as voltages shrink, implying that future technologies will likely see more variations.

The above data points argue for more intelligent memory controllers that can automatically detect and exploit variations in DRAM timing parameters. In addition to the manufacturing variations described above, there are other sources of predictable and unpredictable variations at runtime because of the operating environment. Timing parameters vary predictably as temperature changes. The timing for some operations may vary unpredictably because of voltage supply noise. Some voltage fluctuations are predictable, either caused by static IR-drop [48] or dynamic LdI/dt [26]. The timing for every operation can also vary based on other simultaneous activities on the DRAM chip because some critical resources are shared (most notably, charge pumps and the power delivery network [48]).

Approximate Computing

In recent years, multiple papers [8–10, 14, 18, 31, 38, 42–45, 49, 50, 54] have made the argument that applications in certain domains like media/image processing, computer vision, machine learning, etc. can tolerate a small degree of imprecision. Typically, imprecision can be tolerated in some data structures, but not in the control logic [8, 18, 44]. The underlying hardware exploits the energy/precision or performance/precision trade-off, although, most prior work has exploited the former. A recent example, Truffle [18], splits the processor into an instruction control plane and a data movement/processing plane. The former has to be precise, while the latter can be selectively imprecise depending on the instruction type. A compiler (e.g., EnerJ [44]) is expected to designate instructions as being approximate or not. The processor operates at a single frequency, but the data movement/processing plane can operate at a high or low voltage. At low voltage, the pipeline consumes significantly lower energy, but may yield occasional errors because the circuits are slower and may not meet the cycle time deadlines.

While prior work has examined the microarchitecture for approximate processing and SRAM caches [18], support for approximate memory is more limited. Flikker [50] relies on the application to identify memory pages that can tolerate imprecision; such pages are refreshed less frequently to save energy. A more recent paper by Sampson et al. [45] assumes an MLC PCM main memory and selectively uses a less precise write process or allocates approximate data to rows that are riddled with hard errors. It thus trades off precision for faster writes and better endurance. Apart from Flikker [50], no prior work has examined an approximate DRAM memory system by varying DRAM timing parameters.

3 Proposal

We next define a few essential pieces that will together form the BOF memory system.

Organizing a Rank. Every DRAM chip on a DIMM is expected to have similar worst-case timing parameters. However, a number of banks within each DRAM chip may be able to operate faster. The latency to fetch a cache line from bank-3 is determined by the latency of the slowest bank-3 among the DRAM chips that form a rank. This requires that ranks be created in a manner that is aware of parameter variations on each chip. A chip with a fast bank-3 should be ganged with other DRAM chips with fast bank-3 to form a rank. Building a rank with few DRAM chips helps reduce the negative impact of a slow chip. This argues for using wider-IO DRAM chips or mini-ranks [5, 58]. Run-time re-organization of DRAM chips into ranks may also be possible with a smart buffer chip on the DIMM. The above techniques will help create a memory system that exhibits high non-uniformity – some banks that operate at the typical specified speed and many other banks that can operate at faster speeds.
**BOF Memory Controller.** The next step is to design a memory controller that can track many sets of timing parameters. A memory controller maintains transaction queues for pending read and write operations. A complex scheduling algorithm is used to select operations from these read and write transaction queues (e.g., TCM [28]). The selected transaction is then expanded into smaller commands (e.g., Precharge, Activate, Column-Read) that are placed in per-bank command queues. The command at the head of the command queue is issued when it fulfills a number of timing constraints. The primary change in the BOF memory controller is that it maintains a separate set of timing parameters for each bank. At run time, error rates are tracked for every execution epoch (say, 100 million cycles). If the error rate for a given bank is lower than some threshold, the timing parameters are made more aggressive for the next epoch. If the error rate exceeds the threshold, the timing parameters are made less aggressive. Some hysteresis can also be provided to avoid oscillations.

**Programming Models.** Different thresholds can be used for each bank, allowing each memory region to provide a different point on the performance-precision trade-off curve. These thresholds are exposed to the OS so that the OS can map pages appropriately. For example, OS pages are mapped to the most precise memory regions, while pages designated as being approximate by the application are mapped to the less precise regions. The application is also given the ability to set thresholds for memory regions used by the application. Further, when an approximate load instruction is being issued by the memory controller, it can choose to be even more aggressive than the parameters listed in its table. The error rates for a bank are exposed to the application so that it can choose to throttle the bank up/down or move a page to a better home. As part of future work, we will explore programming models that can exploit such hardware capabilities and such feedback from the hardware.

**Error Correction Support.** Every DRAM chip is different because of parameter variations. As the memory system is pushed closer to the brink of failure, one DRAM chip in the rank is likely to yield errors before the rest. Therefore, chipkill support can help recover from a large fraction of BOF-induced errors [53]. Udipi et al. [53] show that the LOT-ECC chipkill mechanism introduces relatively low overheads in terms of performance and power (under 5%, relative to a memory system with basic SEC-DED support). Chipkill support is only required if the memory system is trying to provide high performance and high precision. For an approximate memory system or the approximate regions of memory, even basic SEC-DED support is enough to detect most errors and provide feedback to the application or OS. SEC-DED codes pose no performance overhead (since code words are read in parallel with data words), but introduce a storage and power overhead of 12.5% in the common case. In several domains, SEC-DED support is a requirement, so this may not represent a new overhead. When uncorrectable errors are encountered, the approximate application is allowed to proceed even though a subset of data bits are corrupted. Applications must therefore be designed to tolerate high imprecision in a few data values.

## 4 Methodology

The project is in its early stages. Our evaluation will ultimately require an empirical analysis of timing margins in actual DRAM chips. In this paper, we simply show the potential for improvement as DRAM timing parameters are varied.

For our simulations, we use Windriver Simics [3] interfaced with the USIMM memory simulator [11]. USIMM is configured to model a DDR4 memory system (bank groups, DDR4 timing parameters, etc.). Simics is used to model eight out-of-order processor cores. These eight cores share a single memory channel with two ranks. Simics and USIMM parameters are summarized in Table 1. Our default scheduler prioritizes row buffer hits, closes a row if there aren’t any pending requests to that row, and uses write queue water marks to drain many writes at once [12].

<table>
<thead>
<tr>
<th>Processor</th>
<th>UltraSPARC III ISA</th>
</tr>
</thead>
<tbody>
<tr>
<td>ISA</td>
<td>32 per channel</td>
</tr>
<tr>
<td>1600 Mbps</td>
<td>1 channel, 2 ranks/channel, 4 per cycle</td>
</tr>
<tr>
<td>Fetch, Dispatch, Execute, and Retire</td>
<td>Maximum</td>
</tr>
<tr>
<td>L32 Gb capacity</td>
<td>32KB/2-way, private, 1-cycle</td>
</tr>
<tr>
<td>8-core, 3.2 GHz</td>
<td>40 (high) and 20 (low), for each channel</td>
</tr>
</tbody>
</table>

| Cache Hierarchy | 52KB/2-way, private, 1-cycle |
| L1 L1-cache | 32KB/2-way, private, 1-cycle |
| L2 Cache | 4MB/64B/8-way, shared, 10-cycle |
| Coherence Protocol | Snooping MESI |
| DRAM Parameters | 1600 Mbps |
| Channels, ranks, banks | 1 channel, 2 ranks/channel, 16 banks/rank |
| Write queue water marks | 40 (high) and 20 (low), for each channel |
| Read Q Length | 32 per channel |
| DRAM chips | 32 Gb capacity |
| Timing Parameters | tCCD = 11, tRCD = 11 |
| RAS = 28, tPAW = 20 |
| tWB = 12, tRP = 11 |
| tRRD = 4, tCAS = 11 |
| tRP = 6, tDATA_TRANS = 4 |
| tCCD = 4, tCCD = 4 |
| tWTR = 6, tWTR = 2 |
| tRCD = 5, tRCD = 4 |
| tREFI = 7.8ns, tREF = 640 ns |

| Table 1. Simulator and DRAM timing [23] parameters. |

We use a collection of multi-programmed workloads from SPEC2k6 (astar, libquantum, lbm, mcf, omnetpp, bzip2, GemsFDTD, les1e3d) and multi-threaded workloads from NAS Parallel Benchmarks (NPB) (eg, ep, mg) and Cloudsuite [19] (cloud9, classification, cassandra). The SPEC workloads are run in rate mode (eight copies of the same program). SPEC programs are fast-forwarded for 50 billion instructions and multi-threaded
applications are fast-forwarded until the start of the region of interest, before detailed simulations are started. Statistics from the first 5 million simulated instructions are discarded to account for cache warm-up effects. Simulations are terminated after a million memory reads are encountered.

5 Results

To estimate the effect of DRAM timing parameters on performance, we vary the following 13 DRAM timing parameters: tRCD, tRP, tRC, tRAS, tCAS, tCWD, tWR, tRTF, tFAW, tRRDL, tRRDS, tCCDL, tCCDS. Each of these timing parameters is improved in unison by 10%, 20%, 30%, 40%, and 50% (tREFI is increased and the rest are decreased). Figure 1 shows the impact on normalized execution time for each of these configurations. We see that a 50% improvement in all parameters results in a 23% reduction in average execution time. A 30% improvement in all timing parameters yields a 17% improvement in execution time, on average.

Figure 1. Cumulative effect of lowering different timing parameters on normalized execution time. Results are shown for timing parameters that are 10%,..., 50% better than the baseline.

To understand which timing parameters have the highest impact on performance, we carry out additional experiments that only modify a subset of these timing parameters in each experiment. Each experiment modifies timing parameters that are related to each other. These results are shown in Figures 2-6. We see that tRFC/tREFI (refresh timing parameters) and tCAS/tRAS/tRCD (delays to read data into row buffers) have the highest impact on performance. The effects of tTRANS/tCCDL/tCCDS (delays to move data to DRAM pins) are moderate. Meanwhile, the effects of tRP/tRAS/tRC (bank cycle time) and tFAW/tRRD (parameters to limit power delivery) are minor. We also observe that the effects of these timing parameters tend to be additive. This analysis is useful in designing a complexity-effective adaptive memory controller – to reduce memory controller complexity, it is best to focus on tRFC/tREFI/tCAS/tRAS/tRCD.

6 Conclusions

This paper argues for a brink-of-failure memory system that aggressively tries to shave DRAM timing margins, thus introducing a trade-off between performance and precision. Such a memory system can also exploit the high degree of parameter variation expected in future technologies. We show that a 30% improvement in a set of DRAM timing parameters can yield a
17% improvement in average execution time for a collection of memory-intensive workloads. Much future work remains, including an empirical analysis of variation in DRAM chips, and support from the OS, programming language, and application to exploit an approximate memory system.

References


