

| Tria                      | ngular Solve (STR                         | SM)        |  |  |  |  |
|---------------------------|-------------------------------------------|------------|--|--|--|--|
| for (j = 0; j < r         | ; j++)                                    |            |  |  |  |  |
| for (k = 0; k < n; k++)   |                                           |            |  |  |  |  |
| if (B[j*ı                 | 1+k] != 0.0f) {                           |            |  |  |  |  |
| for (i = k+1; i < n; i++) |                                           |            |  |  |  |  |
| В                         | B[j*n+i] -= A[k * n + i] * B[j * n + k];  |            |  |  |  |  |
| }                         | ••••••••••••••••••••••••••••••••••••••    |            |  |  |  |  |
| Equivalent to:            |                                           |            |  |  |  |  |
|                           | ' /* left operator */, ' ' /* lower trian |            |  |  |  |  |
| ו'                        | N' /* not transposed */, 'u' /* unit tria | ngular */, |  |  |  |  |
| 1                         | J, N, alpha, d_A, N, d_B, N);             |            |  |  |  |  |
| See: <u>http://w</u>      | ww.netlib.org/blas/strsm.f                |            |  |  |  |  |
| CS6963                    | 3<br>L11: Dense Linear Algebra            |            |  |  |  |  |







<section-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header><section-header>



- Many projects will compare speedup over a sequential CPU implementation

   Ok for this class, but not for a research contribution
- Is your CPU implementation as "smart" as your GPU implementation?
  - Parallel?
  - Manages memory hierarchy?
  - Minimizes synchronization or accesses to global memory?



|                                  | Core i7-960 | GTX280      |
|----------------------------------|-------------|-------------|
| Number PEs                       | 4           | 30          |
| Frequency (GHz)                  | 3.2         | 1.3         |
| Number Transistors               | 0.7B        | 1.4B        |
| BW (GB/sec)                      | 32          | 141         |
| SP SIMD width                    | 4           | 8           |
| DP SIMD width                    | 2           | 1           |
| Peak SP Scalar<br>FLOPS (GFLOPS) | 25.6        | 116.6       |
| Peak SP SIMD<br>Flops (GFLOPS)   | 102.4       | 311.1/933.1 |
| Peak DP SIMD<br>Flops (GFLOPS)   | 51.2        | 77.8        |



UNIVERSITY









## Summary of Representation and Implementation

|        |                            |                             | Bytes/Flop |         |
|--------|----------------------------|-----------------------------|------------|---------|
| Kernel | Granularity                | Coalescing                  | 32-bit     | 64-bit  |
| DIA    | thread : row               | full                        | 4          | 8       |
| ELL    | thread : row               | full                        | 6          | 10      |
| CSR(s) | thread : row               | rare                        | 6          | 10      |
| CSR(v) | warp : row                 | partial                     | 6          | 10      |
| COO    | thread : nonz              | full                        | 8          | 12      |
| НУВ    | thread : row               | full                        | 6          | 10      |
|        | from Bell/Garlo<br>erties. | and: Summary                | of SpMV    | 'kernel |
| 56963  | L12                        | 19<br>Sparse Linear Algebra |            |         |













27 L11: Sparse Linear Algebra

UNIVERSITY

CS6963