A significant fraction of energy in recent CNN accelerators is dissipated in moving operands between storage and compute units. In this work, we re-purpose the CPU's last level cache to perform in-situ dot-product computations, thus significantly reducing data movement. Since a last level cache has several subarrays, many such dot-products can be performed in parallel, thus boosting throughput as well. The in-situ operation does not require analog circuits; it is performed with a bit-wise AND of two subarray rows, followed by digital aggregation of partial sums. The proposed architecture yields a 2.74x improvement in throughput and a 6.31x improvement in energy, relative to a DaDianNao baseline. This is primarily because the proposed architecture eliminates a large fraction of data transfers over H-Tree interconnects in the cache.

We propose incorporating the Logic-in-Memory operation into a processor's LLC. This introduces a small overhead in area, but has a minimal impact on the typical read/write operations of the LLC. When executing a CNN, the LLC or parts of it can be operated as an accelerator. To process a layer of the CNN, its input feature maps and weights are loaded from memory into LLC subarrays. In-situ computations are performed and the output results are retained in the LLC subarrays. When performing the next layer, a new set of weights are loaded into LLC subarrays and the computations continue. Many networks may be able to accommodate all their weights in the LLC, thus avoiding frequent memory fetches.

The multiplication of two operands can be computed by AND-ing each bit of one operand with each bit of the other operand, and aggregating the results with appropriate shifts. To facilitate the AND operation, we leverage the Logic-in-Memory circuit; to reduce data movement, the aggregation is performed adjacent to the subarray without H-Tree traversal.

Above figure shows the overall block diagram of the architecture and an example mapping of operands to one subarray. The figure shows a single LLC bank that has 8 subarrays (only 4 are shown) and a shifter unit. Each subarray has a RAT unit. The rows in a subarray are split between weights and activations. To utilize the computational capability of SISCA, the activations and kernels are distributed evenly across all 1024 subarrays.

In this work, we propose that with some hardware enhancements, the SRAM cache can function as a neural network accelerator. This can be helpful for both server architectures and mobile devices. Our early estimates show that relative to DaDianNao, SISCA offers high throughput and energy efficiency. This is primarily because of the massive data parallelism that can be achieved by activating multiple wordlines among all the subarrays, and the reduced data movement on H-Tree interconnects. While prior works like DaDianNao have leveraged a near-data processing approach, we show that this concept can be further exploited by moving computations even closer to data storage in subarrays.

Moving CNN Accelerators closer to Data

Full Paper:

S. Gudaparthi, S. Narayanan and R. Balasubramonian, "Moving CNN Accelerator Computations Closer to Data"