CS6963: Parallel
Programming for GPUs (3 units)
Schedule: MW 10:45 AM - 12:05 PM
Location: MEB 3105
Instructor:
Mary Hall, MEB 3466, mhall@cs.utah.edu
Office Hours: MW 12:20-1:20 PM
Teaching
Assistant: Sriram Aananthakrishnan,
MEB 3157, sriram@cs.utah.edu
Office hours: 2-3PM, Thursdays
Course Summary
This course examines an important trend in high-performance computing, the use of special-purpose hardware originally designed for graphics and games to solve general-purpose computing problems. Such graphics processing units (GPUs) have enormous peak performance for arithmetically-intensive computations, and at relatively low cost as compared to their general-purpose counterparts with similar performance levels. Technology trends are driving all microprocessors towards multiple core designs, and therefore, the importance of techniques for parallel programming is a rich area of recent study. Students in the course will learn how to develop scalable parallel programs targeting the unique requirements for obtaining high performance on GPUs. We will compare and contrast parallel programming for GPUs and conventional multi-core microprocessors.
The course will largely consist of small individual programming assignments, and a larger term project to be presented to the class. As this course combines hands-on programming and a discussion of research in the area, it is suitable for Masters students and PhD students who wish to learn how to write parallel applications or are engaged in research in related areas.
Prerequisites
Basic knowledge of: programming in C (CS440 or equivalent); algorithms and data structures (CS4150 or equivalent) and computer architecture or game hardware architecture (CS3810 or equivalent).
Textbooks and Resources
http://http.developer.nvidia.com/GPUGems2/gpugems2_part01.html
Homeworks/mini-projects 25%
Midterm test 15%
Project proposal 10%
Project design review 10%
Project Presentation/demo 15%
Project Final Report 20%
Class Participation 5%
Assignment 1:
Reductions, Data and Computation Partitioning
Assignment 2:
Dependences, Parallelization and Locality Optimization
Assignment 3:
Tiling and Performance Experiments
Project Proposal
Example Powerpoint poster (without logos)
Class Topics
Lecture 1:
Introduction (pdf)
(ppt) (mp3)
Lecture 2:
Introduction to CUDA (pdf)
(ppt) (mp3)
Lecture 3:
Synchronization, Data & Memory (pdf)
(ppt) (mp3)
Lecture 4:
Hardware and Execution Model (pdf)
(ppt)
Lecture 5:
Writing Correct Parallel Programs (pdf)
(ppt)
(mp3)
Lecture 6:
Memory Hierarchy Optimization I, Data Placement (pdf)
(ppt)
(mp3)
Lecture 7:
Memory Hierarchy Optimization II, Reuse, Tiling, Unroll-and-Jam (pdf)
(ppt)
(mp3)
Guest Lecture: MPM Application (ppt) (pdf)
Lecture 8:
Memory Hierarchy: Maximizing memory bandwidth (ppt) (pdf)
(mp3)
Lecture 9:
Control Flow (ppt) (pdf)
(mp3)
Lecture 10:
Floating Point and Project (ppt) (pdf)
(mp3)
Lecture 11:
Tools (ppt) (pdf)
(mp3)
Lecture 12:
Introduction to OpenCL
(mp3)
Lecture 13:
Midterm Review (ppt) (pdf)
(mp3)
Lecture 14:
Design Review, Performance Cliffs and Optimization Benefits (ppt) (pdf)
(mp3)
Lecture 15:
CUBLAS Paper Discussion (ppt) (pdf)
Lecture 16:
Design Review Feedback
, Performance Cliffs and Optimization Benefits (ppt) (pdf)
Lecture 17:
Lessons from Particle Systems
(ppt) (pdf)
(mp3)
Lecture 18:
Global Synchronization
(ppt) (pdf)
(mp3)
Lecture 19:
Dynamic Task Queues
(ppt) (pdf)
(mp3)
Week 15: Project
Presentations
Appendix A: Sample
Assignments
Project 1—Embarassingly Parallel Example (Week 2-4)
Develop a simple parallel code in CUDA, such as a search for a particular numerical pattern in a large data set. Report the speedup obtained across different numbers of threads and thread blocks.
Project 2—Performance Tuning (Week 5-7)
Develop a parallel matrix implementation in CUDA (e.g., linear algebra or transitive closure) and simultaneously tune memory hierarchy and parallel performance. Report the results of the following measurements: (i) flop (floating-point operations per second) performance of the program across a set of data sizes and number of processors; (ii) description of code transformations needed to improve performance; (iii) results of performance monitoring measurements for each version of the code.
Project 3—Application Programming Project (Week 9-13)
Apply the techniques from the previous projects to develop a full application in CUDA. These projects will be completed in groups of 2 or 3 students, and will include a design review, project presentation and implementation.
Homework 1 – Parallelism, Race Conditions and Storage Management (Week 1-2)
Identify the parallelization opportunities in example code fragments. Identify race conditions in example code fragments. Show how to express parallelism and data movement in example code fragments.
Homework 2 – Parallel Algorithms (Week 4)
Develop
a parallel algorithm for a dense matrix computation, related to the upcoming
programming assignment.