CS6963: Parallel Programming for GPUs (3 units)

Schedule: MW 10:45 AM - 12:05 PM

Location: MEB 3105

Instructor: Mary Hall, MEB 3466, mhall@cs.utah.edu

Office Hours: MW 12:20-1:20 PM

Teaching Assistant: Sriram Aananthakrishnan, MEB 3157, sriram@cs.utah.edu

Office hours: 2-3PM, Thursdays 

Course Summary

This course examines an important trend in high-performance computing, the use of special-purpose hardware originally designed for graphics and games to solve general-purpose computing problems.  Such graphics processing units (GPUs) have enormous peak performance for arithmetically-intensive computations, and at relatively low cost as compared to their general-purpose counterparts with similar performance levels.  Technology trends are driving all microprocessors towards multiple core designs, and therefore, the importance of techniques for parallel programming is a rich area of recent study.  Students in the course will learn how to develop scalable parallel programs targeting the unique requirements for obtaining high performance on GPUs.  We will compare and contrast parallel programming for GPUs and conventional multi-core microprocessors. 

The course will largely consist of small individual programming assignments, and a larger term project to be presented to the class.  As this course combines hands-on programming and a discussion of research in the area, it is suitable for Masters students and PhD students who wish to learn how to write parallel applications or are engaged in research in related areas.


Basic knowledge of: programming in C (CS440 or equivalent); algorithms and data structures (CS4150 or equivalent) and computer architecture or game hardware architecture (CS3810 or equivalent).

Textbooks and Resources

  • [Recommended] NVidia, CUDA Programmng Guide, available from  http://www.nvidia.com/object/cuda_develop.html  for CUDA 2.0 and Windows, Linux or MAC OS.
  • [Recommended] M. Pharr (editor), GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Addison-Wesley, 2005).


  •  [Additional] Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, 2nd Ed. (Addison-Wesley, 2003).
  • Additional Readings, connected with lectures


Homeworks/mini-projects       25%

Midterm test                            15%

Project proposal                      10%

Project design review              10%

Project Presentation/demo       15%

Project Final Report                 20%

Class Participation                     5%


Assignment 1: Reductions, Data and Computation Partitioning

  • Due 5PM, Friday, January 30

Assignment 2: Dependences, Parallelization and Locality Optimization

  • Due 5PM, Thursday, February 19

Assignment 3: Tiling and Performance Experiments

Project Proposal

  • Due 5PM, Friday, March 13

Example Powerpoint poster (without logos)

Class Topics

Lecture 1: Introduction (pdf) (ppt) (mp3)

  • Course overview
  • Technology trends: why do GPUs look the way they do? Role of specialized accelerators.
  • General-purpose multi-core architectures: what’s different?
  • Parallel software crisis
  • How to determine if an application can get high performance on a GPU.
  • Reading: GPU Gems 2., Ch. 31


Lecture 2: Introduction to CUDA (pdf) (ppt) (mp3)

  • What is CUDA?
  • Computation partitioning constructs
  • Concept of data partitioning and constructs
  • Data management and orchestration
  • CPU version GPU version Work through simple CUDA example
  • Reading: NVIDIA CUDA 2.0 Programmers’ Guide


Lecture 3: Synchronization, Data & Memory (pdf) (ppt) (mp3)

  • Error checking mechanisms
  • Synchronization
  • More on data partitioning
  • Reading: CUDA Programming Guide
  • Reference: Grama et al., Ch. 3


Lecture 4: Hardware and Execution Model (pdf) (ppt)

  • SIMD execution on Streaming Processors
  • MIMD execution across SPs
  • Multithreading to hide memory latency
  • Scoreboarding
  • Reading: CUDA Programming Guide
  • Reference: Grama et al., Ch. 2


Lecture 5: Writing Correct Parallel Programs (pdf) (ppt) (mp3)

  • Race conditions and data dependences
  • Tools and strategies for detecting race conditions
  • Abstractions to reason about safe parallelization
  • Omega calculator
  • Reading: Omega calculator documentation
  • Reference: “Optimizing Compilers for Modern Architectures: A Dependence-Based Approach”, Allen and Kennedy, 2002, Ch. 2.


Lecture 6: Memory Hierarchy Optimization I, Data Placement (pdf) (ppt) (mp3)

  • Complex memory hierarchy: global memory, shared memory, constant memory and constant cache
  • Optimizations for managing limited storage: tiling, unroll-and-jam, register assignment
  • Guiding locality optimizations: reuse analysis
  • Reading: GPU Gems 2, Ch. 32


Lecture 7: Memory Hierarchy Optimization II, Reuse, Tiling, Unroll-and-Jam (pdf) (ppt) (mp3)

  • Optimizations for managing limited storage: tiling, unroll-and-jam, register assignment
  • Guiding locality optimizations: reuse analysis
  • GPU matrix multiply (compile using: nvcc -I/Developer/common/inc -L/Developer/CUDA/lib mmul.cu -lcutil)
  • Reading: GPU Gems 2, Ch. 32


Guest Lecture: MPM Application (ppt) (pdf)


Lecture 8: Memory Hierarchy: Maximizing memory bandwidth (ppt) (pdf) (mp3)

  • Global memory coalescing and alignment
  • Memory bank conflicts
  • Reading: GPU Gems 2, Ch.32 and 34


Lecture 9: Control Flow (ppt) (pdf) (mp3)

  • SIMD Execution Model and Control Flow
  • Avoiding control flow
  • Warp organization and interaction with control flow
  • Reading: GPU Gems 2, Ch.32 and 34


Lecture 10: Floating Point and Project (ppt) (pdf) (mp3)

  • Accuracy
  • Performance
  • Project discussion


Lecture 11: Tools (ppt) (pdf) (mp3)

  • Occupancy Calculator
  • Performance Feedback


Lecture 12: Introduction to OpenCL (mp3)


Lecture 13: Midterm Review (ppt) (pdf) (mp3)


Lecture 14: Design Review, Performance Cliffs and Optimization Benefits (ppt) (pdf) (mp3)



Lecture 15: CUBLAS Paper Discussion (ppt) (pdf)



Lecture 16: Design Review Feedback , Performance Cliffs and Optimization Benefits (ppt) (pdf)



Lecture 17: Lessons from Particle Systems (ppt) (pdf) (mp3)


Lecture 18: Global Synchronization (ppt) (pdf) (mp3)


Lecture 19: Dynamic Task Queues (ppt) (pdf) (mp3)


Week 15: Project Presentations


Appendix A: Sample Assignments

Project 1—Embarassingly Parallel Example (Week 2-4)

Develop a simple parallel code in CUDA, such as a search for a particular numerical pattern in a large data set.  Report the speedup obtained across different numbers of threads and thread blocks. 

Project 2—Performance Tuning (Week 5-7)

Develop a parallel matrix implementation in CUDA (e.g., linear algebra or transitive closure) and simultaneously tune memory hierarchy and parallel performance.  Report the results of the following measurements: (i) flop (floating-point operations per second) performance of the program across a set of data sizes and number of processors; (ii) description of code transformations needed to improve performance; (iii) results of performance monitoring measurements for each version of the code.

Project 3—Application Programming Project (Week 9-13)

Apply the techniques from the previous projects to develop a full application in CUDA. These projects will be completed in groups of 2 or 3 students, and will include a design review, project presentation and implementation. 


Homework 1 – Parallelism, Race Conditions and Storage Management (Week 1-2)

Identify the parallelization opportunities in example code fragments. Identify race conditions in example code fragments.  Show how to express parallelism and data movement in example code fragments.


Homework 2 – Parallel Algorithms (Week 4)

Develop a parallel algorithm for a dense matrix computation, related to the upcoming programming assignment.