CS6963: Parallel Programming for GPUs (X units)

CS6963: Parallel Programming for GPUs (3 units)

Schedule: MW 10:45 AM - 12:05 PM

Location: MEB 3105

Instructor: Mary Hall, MEB 3466, mhall@cs.utah.edu

Office Hours: MW 12:20-1:20 PM

Teaching Assistant: Sriram Aananthakrishnan, MEB 3157, sriram@cs.utah.edu

Office hours: 2-3PM, Thursdays

Course Summary

This course examines an important trend in high-performance computing, the use of special-purpose hardware originally designed for graphics and games to solve general-purpose computing problems. Such graphics processing units (GPUs) have enormous peak performance for arithmetically-intensive computations, and at relatively low cost as compared to their general-purpose counterparts with similar performance levels. Technology trends are driving all microprocessors towards multiple core designs, and therefore, the importance of techniques for parallel programming is a rich area of recent study. Students in the course will learn how to develop scalable parallel programs targeting the unique requirements for obtaining high performance on GPUs. We will compare and contrast parallel programming for GPUs and conventional multi-core microprocessors.

The course will largely consist of small individual programming assignments, and a larger term project to be presented to the class. As this course combines hands-on programming and a discussion of research in the area, it is suitable for Masters students and PhD students who wish to learn how to write parallel applications or are engaged in research in related areas.

Prerequisites

Basic knowledge of: programming in C (CS440 or equivalent); algorithms and data structures (CS4150 or equivalent) and computer architecture or game hardware architecture (CS3810 or equivalent).

Textbooks and Resources

[Recommended] NVidia, CUDA Programmng Guide, available from http://www.nvidia.com/object/cuda_develop.html for CUDA 2.0 and Windows, Linux or MAC OS.
[Recommended] M. Pharr (editor), GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Addison-Wesley, 2005).

http://http.developer.nvidia.com/GPUGems2/gpugems2_part01.html

[Additional] Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, 2nd Ed. (Addison-Wesley, 2003).
Additional Readings, connected with lectures

Grading

Homeworks/mini-projects 25%

Midterm test 15%

Project proposal 10%

Project design review 10%

Project Presentation/demo 15%

Project Final Report 20%

Class Participation 5%

Assignments

Assignment 1: Reductions, Data and Computation Partitioning

Due 5PM, Friday, January 30

Assignment 2: Dependences, Parallelization and Locality Optimization

Due 5PM, Thursday, February 19

Assignment 3: Tiling and Performance Experiments

Due 5PM, Wednesday, March 4

Sample CPU code

Project Proposal

Due 5PM, Friday, March 13

Example Powerpoint poster (without logos)

Class Topics

Lecture 1: Introduction (pdf) (ppt) (mp3)

Course overview

Technology trends: why do GPUs look the way they do? Role of specialized accelerators.

General-purpose multi-core architectures: what’s different?

Parallel software crisis

How to determine if an application can get high performance on a GPU.

Reading: GPU Gems 2., Ch. 31

Lecture 2: Introduction to CUDA (pdf) (ppt) (mp3)

What is CUDA?

Computation partitioning constructs

Concept of data partitioning and constructs

Data management and orchestration

CPU version GPU version Work through simple CUDA example

Reading: NVIDIA CUDA 2.0 Programmers’ Guide

Lecture 3: Synchronization, Data & Memory (pdf) (ppt) (mp3)

Error checking mechanisms

Synchronization

More on data partitioning

Reading: CUDA Programming Guide

Reference: Grama et al., Ch. 3

Lecture 4: Hardware and Execution Model (pdf) (ppt)

SIMD execution on Streaming Processors

MIMD execution across SPs

Multithreading to hide memory latency
Scoreboarding

Reading: CUDA Programming Guide

Reference: Grama et al., Ch. 2

Lecture 5: Writing Correct Parallel Programs (pdf) (ppt) (mp3)

Race conditions and data dependences

Tools and strategies for detecting race conditions

Abstractions to reason about safe parallelization
Omega calculator

Reading: Omega calculator documentation

Reference: “Optimizing Compilers for Modern Architectures: A Dependence-Based Approach”, Allen and Kennedy, 2002, Ch. 2.

Lecture 6: Memory Hierarchy Optimization I, Data Placement (pdf) (ppt) (mp3)

Complex memory hierarchy: global memory, shared memory, constant memory and constant cache

Optimizations for managing limited storage: tiling, unroll-and-jam, register assignment

Guiding locality optimizations: reuse analysis

Reading: GPU Gems 2, Ch. 32

Lecture 7: Memory Hierarchy Optimization II, Reuse, Tiling, Unroll-and-Jam (pdf) (ppt) (mp3)

Optimizations for managing limited storage: tiling, unroll-and-jam, register assignment

Guiding locality optimizations: reuse analysis

GPU matrix multiply (compile using: nvcc -I/Developer/common/inc -L/Developer/CUDA/lib mmul.cu -lcutil)

Reading: GPU Gems 2, Ch. 32

Guest Lecture: MPM Application (ppt) (pdf)

Application overview

Conventional parallel implementation

GPU challenges

Reference sequential implementation: Phil Wallstedt GIMP code (see preprint below)

Reading:
A particle method for history-dependent materials, D. Sulsky, Z. Chen and H. L. Schreyer, Comput. Methods Appl. Mech. Engrg, 118(1994) 179-196.

Introductory Notes

An Evaluation of Explicit Time Integration Schemes For Use with the Generalized Interpolation Material Point Method, P. C. Wallstedt J. E. Guilkey, Preprint submitted to Elsevier, July, 2008.

Lecture 8: Memory Hierarchy: Maximizing memory bandwidth (ppt) (pdf) (mp3)

Global memory coalescing and alignment

Memory bank conflicts

Reading: GPU Gems 2, Ch.32 and 34

Lecture 9: Control Flow (ppt) (pdf) (mp3)

SIMD Execution Model and Control Flow

Avoiding control flow

Warp organization and interaction with control flow

Reading: GPU Gems 2, Ch.32 and 34

Lecture 10: Floating Point and Project (ppt) (pdf) (mp3)

Accuracy

Performance

Project discussion

Lecture 11: Tools (ppt) (pdf) (mp3)

Occupancy Calculator

Performance Feedback

Lecture 12: Introduction to OpenCL (mp3)

Lecture 13: Midterm Review (ppt) (pdf) (mp3)

Lecture 14: Design Review, Performance Cliffs and Optimization Benefits (ppt) (pdf) (mp3)

Lecture 15: CUBLAS Paper Discussion (ppt) (pdf)

Lecture 16: Design Review Feedback , Performance Cliffs and Optimization Benefits (ppt) (pdf)

Lecture 17: Lessons from Particle Systems (ppt) (pdf) (mp3)

Lecture 18: Global Synchronization (ppt) (pdf) (mp3)

Lecture 19: Dynamic Task Queues (ppt) (pdf) (mp3)

Week 15: Project Presentations

Appendix A: Sample Assignments

Project 1—Embarassingly Parallel Example (Week 2-4)

Develop a simple parallel code in CUDA, such as a search for a particular numerical pattern in a large data set. Report the speedup obtained across different numbers of threads and thread blocks.

Project 2—Performance Tuning (Week 5-7)

Develop a parallel matrix implementation in CUDA (e.g., linear algebra or transitive closure) and simultaneously tune memory hierarchy and parallel performance. Report the results of the following measurements: (i) flop (floating-point operations per second) performance of the program across a set of data sizes and number of processors; (ii) description of code transformations needed to improve performance; (iii) results of performance monitoring measurements for each version of the code.

Project 3—Application Programming Project (Week 9-13)

Apply the techniques from the previous projects to develop a full application in CUDA. These projects will be completed in groups of 2 or 3 students, and will include a design review, project presentation and implementation.

Homework 1 – Parallelism, Race Conditions and Storage Management (Week 1-2)

Identify the parallelization opportunities in example code fragments. Identify race conditions in example code fragments. Show how to express parallelism and data movement in example code fragments.

Homework 2 – Parallel Algorithms (Week 4)

Develop a parallel algorithm for a dense matrix computation, related to the upcoming programming assignment.