CS6963: Parallel Programming for GPUs (3 units)

CS6963: Parallel Programming for GPUs (3 units)

Schedule: MW 10:45 AM - 12:05 PM

Location: MEB 3147

Instructor: Mary Hall, MEB 3466, mhall@cs.utah.edu

Office Hours: M 12:20-1:00 PM, Th 11:00-11:40 AM

Teaching Assistant: Sriram Aananthakrishnan, MEB 3115, sriram@cs.utah.edu

Office hours: TBA

Course Summary

This course examines an important trend in high-performance computing, the use of special-purpose hardware originally designed for graphics and games to solve general-purpose computing problems. Such graphics processing units (GPUs) have enormous peak performance for arithmetically-intensive computations, and at relatively low cost as compared to their general-purpose counterparts with similar performance levels. Technology trends are driving all microprocessors towards multiple core designs, and therefore, techniques for parallel programming represent a rich area of recent study. Students in the course will learn how to develop scalable parallel programs targeting the unique requirements for obtaining high performance on GPUs. We will compare and contrast parallel programming for GPUs and conventional multi-core microprocessors.

The course will largely consist of small individual programming assignments, and a larger term project to be presented to the class. Several of these projects from last year's course were presented in workshops and poster sessions, and contributed to Masters and PhD research. As this course combines hands-on programming and a discussion of research in the area, it is suitable for Masters students and PhD students who wish to learn how to write parallel applications or are engaged in research in related areas.

Prerequisites

Basic knowledge of: programming in C (CS440 or equivalent); algorithms and data structures (CS4150 or equivalent) and computer architecture or game hardware architecture (CS3810 or equivalent).

Textbooks and Resources

[Recommended] Programming Massively Parallel Processors: A Hands On Approach, available from http: David Kirk and Wen-mei Hwu, February 2010, Morgan Kaufmann Publishers, ISBN 0123814723.

[Recommended] NVidia, CUDA Programmng Guide, available from http://www.nvidia.com/object/cuda_develop.html for CUDA 2.0 and Windows, Linux or MAC OS.

[Additional] Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, 2nd Ed. (Addison-Wesley, 2003).
Additional Readings, connected with lectures

Grading

Homeworks/mini-projects 30%

Midterm test 15%

Project proposal 10%

Project design review 10%

Project Presentation/demo 15%

Project Final Report 20%

Assignments

Assignment 1: Basic CUDA Program Structure

driver.c

CMakeLists.txt

Due 5PM, Friday, January 21
Submit using handin on CADE machines: “handin cs6963 lab1 ”

Assignment 2: Computation Partitioning and Memory Hierarchy

Description of Assignment README, Sequential code, Cmake, driver
Due 5PM, Tuesday, February 8
Submit using handin on CADE machines: “handin cs6963 lab2 ”

Assignment 3: Load Balancing and Advanced Memory Hierarchy

simpleCUBLAS.c

CMakefile

hw3

Due 5PM, Tuesday, March 15
Submit using handin on CADE machines: “handin cs6963 lab3 ”

Project Proposal Sample proposals graph coloring program analysis

Due 5PM, Monday, March 7
Submit using handin on CADE machines: “handin cs6963 prop ”

Class Topics

Lecture 1: Introduction to GPUs and CUDA (ppt) (pdf)

Technology trends: why do GPUs look the way they do? Role of specialized accelerators.
General-purpose multi-core architectures: what’s different?
What is CUDA?
Computation partitioning constructs
Concept of data partitioning and constructs
Data management and orchestration
CPU version GPU version Work through simple CUDA example
Reading: NVIDIA CUDA 3.2 Programmers’ Guide Ch. 1-3, Kirk and Hwu Ch. 1-2

Lecture 2: Hardware and Execution Model (ppt) (pdf)

SIMD execution on Streaming Processors
MIMD execution across SPs
Multithreading to hide memory latency
Scoreboarding

Reading: NVIDIA CUDA 3.2 Programmers' Guide Ch. 4, Kirk and Hwu Ch. 3
Reference: Grama et al., Ch. 2

Lecture 3: Memory Hierarchy Optimization I, Locality and Data Placement (ppt) (pdf)

Complex memory hierarchy: global memory, shared memory, constant memory and constant cache
Optimizations for managing limited storage: tiling, unroll-and-jam, register assignment
Guiding locality optimizations: reuse analysis
Reading: Kirk and Hwu, Ch. 4

Lecture 4: Memory Hierarchy Optimization II, Reuse, Tiling (ppt) (pdf)

Optimizations for managing limited storage: tiling, unroll-and-jam, register assignment
Guiding locality optimizations: reuse analysis
GPU matrix multiply (compile using: nvcc -I/Developer/common/inc -L/Developer/CUDA/lib mmul.cu -lcutil)
Reading: Kirk and Hwu, Ch. 4

Lecture 5: Memory Hierarchy Optimization III, Unroll-and-Jam, Bandwidth Optimizations (ppt) (pdf)

Optimizations for managing limited storage: unroll-and-jam, register assignment
Bandwidth optimizations: global memory coalescing, avoiding shared memory bank conflicts
Reading: Kirk and Hwu, Ch. 4 and 5

Lecture 6: Memory Hierarchy Optimization IV, Bandwidth Optimizations and Jacobi Case Study (ppt) (pdf)

2-D Jacobi stencil case study
Bandwidth optimizations: global memory coalescing, avoiding shared memory bank conflicts
Reading: Kirk and Hwu, Ch. 4 and 5

Lecture 7: Writing Correct Programs (Synchronization) (ppt) (pdf)

Error checking mechanisms
Synchronization
Data Dependences and Race Conditions

Reading: NVIDIA CUDA 3.2 Programming Guide, Appendix B.4, B.5 and B.10
Reference: “Optimizing Compilers for Modern Architectures: A Dependence-Based Approach”, Allen and Kennedy, 2002, Ch. 2

Lecture 8: Control Flow (ppt) (pdf)

Divergent Branches, Predicated Execution,
Reading: Kirk and Hwu, Ch. 5, NVIDIA Programming Guide, 5.4.2 and B.11

Lecture 9: Projects and Floating Point (ppt) (pdf)

Projects and Upcoming Proposals
Floating Point Precision vs. Accuracy
Reading: Kirk and Hwu, Ch. 6, NVIDIA Programming Guide, Appendix B

Lecture 10: Dense Linear Algebra (ppt) (pdf)

Projects and Upcoming Proposals
CUBLAS 2.0 and research implementations
Reading: Volkov, V., and Demmel, J. W. 2008. Benchmarking GPUs to tune dense linear algebra, SC08, November 2008. (slides)

Lecture 11: Sparse Linear Algebra (ppt) (pdf)

Sparse matrix representations
Use in stencil computations
Reading:
"Implementing Sparse Matrix-Vector Multiplication on Throughput Oriented Processors,” Bell and Garland (Nvidia), SC09, Nov. 2009.
"Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs”, Choi, Singh, Vuduc, PPoPP 10, Jan. 2010.
“Optimizing sparse matrix-vector multiply on emerging multicore platforms,” Journal of Parallel Computing, 35(3):178-194, March 2009. (Expanded from SC07 paper.)

Lecture 12: Application Case Studies (ppt) (pdf)

gemv and transposed gemv, more on trsm, intro to LU
Reading: Kirk and Hwu, Ch. 7

Guest Lectures

Lecture 13: Application Case Studies II (ppt) (pdf) Molecular Dynamics (Coulomb) from textbook

gemv and transposed gemv, more on trsm, intro to LU
Reading: Kirk and Hwu, Ch. 8

Lecture 14: Dynamic Task Queues and Synchronization (ppt) (pdf)

Lecture 15: Tree-Based Algorithms (ppt) (pdf)

Midterm Review (ppt) (pdf)

2010 midterm 2009 midterm