





| Understanding Global Memory Accesses                                                                                                                                                          |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Memory protocol for compute capability<br>1.2* (CUDA Manual 5.1.2.1)                                                                                                                          |
| <ul> <li>Start with memory request by smallest numbered<br/>thread. Find the memory segment that contains the<br/>address (32, 64 or 128 byte segment, depending on<br/>data type)</li> </ul> |
| <ul> <li>Find other active threads requesting addresses<br/>within that segment and coalesce</li> </ul>                                                                                       |
| <ul> <li>Reduce transaction size if possible</li> </ul>                                                                                                                                       |
| <ul> <li>Access memory and mark threads as "inactive"</li> </ul>                                                                                                                              |
| <ul> <li>Repeat until all threads in half-warp are serviced<br/>*Includes Tesla and GTX platforms</li> </ul>                                                                                  |

L8: Memory Hierarchy III

CS6963

UNIVERSITY OF LITAH









CS6963

UNIVERSITY OF UTAH















## How addresses map to banks on G80

- Each bank has a bandwidth of 32 bits per clock cycle
- Successive 32-bit words are assigned to successive banks
- G80 has 16 banks
  - So bank = address % 16
  - Same as the size of a half-warp
    - No bank conflicts between different halfwarps, only within a single half-warp

© David Kirk/NVIDIA and Wen-mel W. Hwu, 2007-2009 20 ECE 498AL, University of Illinois, Urbana-Champaign L8: Memory Hierarchy III

UNIVERSITY













- Think about memory access patterns across threads
  - May need a different computation & data partitioning
  - Sometimes "padding" can be used on a dimension to align accesses

CS6963

```
27
L8: Memory Hierarchy III
```

CS6963



- Maximize Memory Bandwidth! – Make each memory access count
- Exploit spatial locality in global memory accesses
- The opposite goal in shared memory

   Each thread accesses independent memory
   banks

28 L8: Memory Hierarchy III