CS6235: Parallel
Programming for Many-Core Architectures (3 units)
Schedule: MW 10:45 AM - 12:05 PM
Location: MEB 3105
Instructor:
Mary Hall, MEB 3466, mhall@cs.utah.edu
Office Hours: M 12:20-1:00 PM, Th 11:00-11:40 AM
Teaching
Assistant: Saurav Muralidharan
sauravm@cs.utah.edu
Office hours: Wednesday after class and by appointment
Course Summary
This course examines an important trend in high-performance
computing, the use of special-purpose hardware originally designed for graphics
and games to solve general-purpose computing problems. Such graphics processing units (GPUs) have enormous peak performance for
arithmetically-intensive computations, and at relatively low cost as compared
to their general-purpose counterparts with similar performance levels. Technology trends are driving all
microprocessors towards multiple core designs, and therefore,
techniques for parallel programming represent a rich area of recent study. Students in the course will learn how to
develop scalable parallel programs targeting the unique requirements for
obtaining high performance on GPUs. We will compare and contrast parallel
programming for GPUs and conventional multi-core
microprocessors.
The course will largely consist of small individual
programming assignments, and a larger term project to be presented to the
class. Several of these projects from last year's course were presented
in workshops and poster sessions, and contributed to Masters and PhD research.
As this course combines hands-on
programming and a discussion of research in the area, it is suitable for Masters
students and PhD students who wish to learn how to
write parallel applications or are engaged in research in related areas.
Prerequisites
Basic knowledge of: programming in C
(CS440 or equivalent); algorithms and data structures
(CS4150 or equivalent) and computer
architecture or game hardware architecture (CS3810 or equivalent).
Textbooks and Resources
- [Recommended] Programming Massively Parallel Processors: A Hands On Approach,
available from http:
David Kirk and Wen-mei Hwu, Dec. 2012 (2nd edition), Morgan Kaufmann Publishers, ISBN 0124159923.
- [Recommended] NVidia, CUDA Programmng
Guide, available from http://www.nvidia.com/object/cuda_develop.html
for CUDA 2.0 and Windows, Linux or
MAC OS.
- [Additional] Grama,
A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, 2nd
Ed. (Addison-Wesley, 2003).
- Additional Readings, connected with lectures
Grading
Homeworks/mini-projects 35%
Midterm test 15%
Project proposal 5%
Project design review 10%
Project Presentation/demo 15%
Project Final Report 20%
Assignments
Assignment 1:
Basic CUDA Program Structure
Your assignment is to simply add and multiply two vectors to get started
writing programs in CUDA.
In the regression test (in driver.c ), the addition and multiplication are coded into the functions, and the makefile compiles and links for the grad lab.
Due 5PM, Friday, January 18
Submit using handin on CADE machines: “handin cs6235 lab1 ”
Assignment 2:
Memory Hierarchy Optimization
Supporting files: Input
Sample Output
Host implementation in C++
dense_matvec.cu example (compareData,Timing)
makefile.dense example for CUDA5
Due 5PM, Friday, February 8, 5PM
Assignment 3:
Load Balancing and Advanced Memory Hierarchy
Compare your performance to
Your assignment is to write a CUDA implementation of triangular solve and
compare its performance with the CUBLAS library. I have augmented the
simpleCUBLAS project simpleCUBLAS.c
in the CUDA SDK to provide a reference CPU implementation for strsm and compare
against the CUBLAS call to cublasStrsm.
Submit using handin on CADE machines: “handin cs6235 lab3 ”
Related reference
Due 5PM, Tuesday, March 5, 5PM
Assignment 4:
Global Address Space
Explore the new global address features of CUDA 5
Submit using handin on CADE machines: “handin cs6235 lab4 ”
Project Proposal
Sample proposals graph coloring
program analysis
-
Due 5PM, Friday, March 8
-
Submit using handin on CADE machines: “handin cs6235 prop ”
Project Posters
Sample posters graph coloring
program analysis
Project Final Report
Sample final report Greater-Than-Strong
Class Topics
Lecture 1:
Introduction to GPUs and CUDA (ppt)
(pdf)
- Technology
trends: why do GPUs look the way they do? Role
of specialized accelerators.
- General-purpose
multi-core architectures: what’s different?
- What
is CUDA?
- Computation
partitioning constructs
- Concept
of data partitioning and constructs
- Data
management and orchestration
-
CPU version GPU version makefile
Work
through simple CUDA example
- Reading: NVIDIA
CUDA 5.0 Programmers Guide Ch. 1-3, Kirk and Hwu Ch. 1-2
Lecture 2:
Hardware and Execution Model (ppt)
(pdf)
-
SIMD execution on Streaming Processors
-
MIMD execution across SPs
-
Multithreading to hide memory latency
-
Scoreboarding
- Reading:
NVIDIA CUDA 3.2 Programmers' Guide Ch. 4, Kirk and Hwu Ch. 3
- Reference: Grama et al., Ch. 2
Lecture 3:
Memory Hierarchy Optimization I, Locality and Data Placement (ppt)
(pdf)
- Complex
memory hierarchy: global memory, shared memory, constant memory and constant cache
- Optimizations
for managing limited storage: tiling, unroll-and-jam, register assignment
- Guiding
locality optimizations: reuse analysis
-
Reading: Kirk and Hwu, Ch.
4
Lecture 4:
Memory Hierarchy Optimization II, Reuse, Tiling (ppt)
(pdf)
- Optimizations
for managing limited storage: tiling, unroll-and-jam, register assignment
- Guiding
locality optimizations: reuse analysis
-
GPU matrix multiply (compile using:
nvcc -I/Developer/common/inc -L/Developer/CUDA/lib mmul.cu -lcutil)
-
Reading: Kirk and Hwu, Ch.
4
Lecture 5:
Memory Hierarchy Optimization III, Unroll-and-Jam, Bandwidth Optimizations (ppt)
(pdf)
- Optimizations
for managing limited storage: unroll-and-jam, register assignment
- Bandwidth
optimizations: global memory coalescing, avoiding shared memory bank conflicts
-
Reading: Kirk and Hwu, Ch.
4 and 5
Lecture 6:
Memory Hierarchy Optimization IV, Bandwidth Optimizations and Jacobi Case Study (ppt)
(pdf)
-
2-D Jacobi stencil case study
- Bandwidth
optimizations: global memory coalescing, avoiding shared memory bank conflicts
-
Reading: Kirk and Hwu, Ch.
4 and 5
Lecture 7:
Writing Correct Programs (Synchronization) (ppt)
(pdf)
-
Error checking mechanisms
- Synchronization
-
Data Dependences and Race Conditions
- Reading:
NVIDIA CUDA 3.2 Programming Guide, Appendix B.4, B.5 and B.10
- Reference:
“Optimizing Compilers for Modern Architectures: A Dependence-Based Approach”, Allen and Kennedy, 2002, Ch. 2
Lecture 8:
Control Flow (ppt)
(pdf)
-
Divergent Branches, Predicated Execution,
-
Reading: Kirk and Hwu, Ch.
5, NVIDIA Programming Guide, 5.4.2 and B.11
-
Projects and Upcoming Proposals
-
Floating Point Precision vs. Accuracy
-
Reading: Kirk and Hwu, Ch.
6, NVIDIA Programming Guide, Appendix B
Lecture 9:
CUDA-CHiLL and Dense Linear Algebra (ppt)
(pdf)
Lecture 10/11:
Sparse Linear Algebra (ppt)
(pdf)
Graph Coloring (ppt)
(pdf)
-
Sparse matrix representations
-
Use in stencil computations
-
Reading:
-
"Implementing Sparse Matrix-Vector Multiplication on Throughput Oriented Processors,” Bell and Garland (Nvidia), SC09, Nov. 2009.
- "Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs”, Choi, Singh, Vuduc, PPoPP 10, Jan. 2010.
-
“Optimizing sparse matrix-vector multiply on emerging multicore platforms,” Journal of Parallel Computing, 35(3):178-194, March 2009. (Expanded from SC07 paper.)
Lecture 12:
Application Case Studies I (ppt)
(pdf)
Generalized Interpolation Material Point Method (ppt)
(pdf)
Lecture 13:
Application Case Studies II: Coloumb Potential
(pdf)
Lecture 14:
Sorting (and OpenGL integration) (ppt) (pdf)
Lecture 15:
Stencil Computations
(pdf)
Midterm Review (ppt) (pdf)
2012 midterm
2011 midterm
2010 midterm 2009 midterm
Lecture 16a:
Design Review and 6-Function MPI (ppt) (pdf)
Lecture 16b:
Libraries, OpenCL and OpenACC (ppt) (pdf)
Lecture 17:
Generalizing CUDA (ppt) (pdf)
Lecture 18:
Architecture Future and Getting to Exascale (ppt) (pdf)