SOMA: Observability, Monitoring, and In Situ Analytics in Exascale Applications

Dewi Yokelson, Oskar Johannes Lappi, Srinivasan Ramesh, Miikka Väisälä, Kevin Huck, Touko Puro, Boyana Norris, Maarit Korpi-Lagg, Keijo Heljanko, Allen D. Malony

Research output: Contribution to journalConference articleScientificpeer-review

Abstract

With the rise of exascale systems and large, data-centric workflows, the need to observe and analyze high performance computing (HPC) applications during their execution is becoming increasingly important. HPC applications are typically not designed with online monitoring in mind, therefore, the observability challenge lies in being able to access and analyze interesting events with low overhead while seamlessly integrating such capabilities into existing and new applications. We explore how our service-based observation, monitoring, and analytics (SOMA) approach to collecting and aggregating both application-specific diagnostic data and performance data addresses these needs. We present our SOMA framework and demonstrate its viability with LULESH, a hydrodynamics proxy application. Then we focus on Astaroth, a multi-GPU library for stencil computations, highlighting the integration of the TAU and APEX performance tools and SOMA for application and performance
data monitoring.
Original languageEnglish
JournalConcurrency and Computation: Practice & Experience
Number of pages13
ISSN1532-0626
DOIs
Publication statusPublished - 2 May 2024
MoE publication typeA4 Article in conference proceedings
EventCray User Group - Helsinki, Finland
Duration: 7 May 202311 May 2023

Fields of Science

  • 113 Computer and information sciences

Cite this