Colloquium – Tanzima Islam
March 25 @ 10:00 am - 11:00 am
Western Washington University
March 25, 2019
lecture – 10:00am
Host: Hari Sundar
Data-Driven Analysis for Performance Characterization and Comparison in HPC Co-design
While High Performance Computing (HPC) systems are getting more compute resources to solve larger problems at finer resolution, it is becoming more difficult for the applications to efficiently utilize the underlying machines (also known as scalability problem) due to the complex nature of interactions among components such as many- and multi-core architectures, network technologies, programming models, and software layers. Synergistic activities among hardware vendors, software architects, and application developers–known as HPC co-design–aim at collectively preparing for deploying the first exascale supercomputer and effectively utilizing these million-dollar procurements. Application teams often isolate the performance-critical regions of large-scale production codes into smaller applications called proxies. Proxy applications enable easy evaluation of performance behaviors of larger, and more complex applications across experimental systems for procurement or co-design purposes. The current approaches of validating proxy applications are ad-hoc and often manual. In this talk, I will present my recent work on using data-driven analysis approach for systematically validating proxy applications. Specifically, my work answers the following questions: a) which hardware resources impact on-node scaling behavior of applications? and b) how well does a proxy capture the resource utilization behaviors of its production counterpart? This principled approach reduces the search space of performance variables by 97% and has successfully identified disparity in a proxy for the Gordon Bell winning application Nek5000.
Tanzima Islam is an assistant professor at Western Washington University (WWU). Before joining WWU, Dr. Islam earned her Ph.D. in Computer Engineering from Purdue University and was a postdoctoral scholar at Lawrence Livermore National Laboratory. Broadly, Dr. Islam’s research interests lie in leveraging data-driven analysis approaches to understand the scalability bottlenecks of HPC applications, and develop system solutions, where applicable, to mitigate them. While her most recent work focuses on developing machine learning models for characterizing, comparing, and predicting the compute performance of applications, her past work involved developing scalable checkpointing systems and scientific data compression methods. Dr. Islam has published in international conferences and journals (a couple of best paper nominations at SC), received the Director’s Science & Technology award from LLNL for scaling checkpointing systems significantly for HPC environments. Her diverse research background enables Dr. Islam to collaborate regularly with various national laboratories, academic institutions, and the industry.