Refreshments 3:20 p.m.
Abstract
Scientific programmers often turn to vendor-tuned Basic Linear Algebra
Subprograms (BLAS) to obtain portable high performance. However, many
numerical algorithms require several BLAS calls in sequence, and those
successive calls result in suboptimal performance because the entire
sequence should be optimized in concert. Alternatively, a programmer
could start write their matrix operations directly in Fortran or C and
use a state-of-the-art optimizing compiler. However, our experiments
show that optimizing compilers often attain only one-quarter the
performance of hand-optimized code. In this talk I present a
domain-specific compiler for matrix algebra, the Build to Order BLAS
(BTO), that reliably achieves high performance using a scalable search
algorithm for choosing the best combination of loop fusion, array
contraction, and multithreading for data parallelism. The BTO
compiler generates code that is between 16% slower and 39% faster than
hand-optimized code.