Ongoing Project - Dorsal Laboratory

Tuning, Debugging and Monitoring High Performance Heterogeneous Systems at Exascale

The new research project, entitled "Tuning, Debugging and Monitoring High Performance Heterogeneous Systems at Exascale" is decomposed into 4 tracks which are detailed in this page. Track 1 will be supervised by professor Heng Li, track 2 by professor Maxime Lamothe and track 3 by professor Foutse Khomh. Tracks 1, 2 and 3 are co-supervised by professor Michel Dagenais, while track 4 is supervised by professor Daniel Aloise.

Introduction

In recent years, the computing infrastructure continued its rapid development with increased computing capacity and bandwidth, increased connectivity, and the widespread use of special purpose heterogeneous processors for specific tasks. These very large scale computing systems are being used for networking and telecommunications, for data centers, and for scientific computing. The term Exascale (10^18 operations per second) illustrates well the current state of the art and the challenge of getting so many computations, from millions of cores, combined together in a single application. The complexity at the hardware level comes from the very large number of concurrent nodes running asynchronously, the hardware heterogeneity, and the networking sophistication. The complexity at the software level comes from the distribution of the computation over several networked nodes that may fail, and the large number of software layers.

Track 1: Runtime tracing, debugging and verification

The first track focuses on low level data collection mechanisms, that either store the information in a trace, or use it directly at runtime for verifying the coherence of the execution. This includes the collection of data in traces or profiles, interactions with a debugger, or specialized runtime verification tools (e.g., validating memory accesses). The challenge, in order to understand the performance and behavior of even the most complex applications, is to obtain a good observability at all levels, from the operating system to the frameworks and applications, with negligible overhead. Indeed, a higher overhead may prevent such methods from being used to diagnose issues in production, when this is required. It may also modify the behavior of the system under scrutiny, thus strongly affecting the validity of the analysis. Several sources of data are already available, such as static tracepoints in the Linux kernel and in userspace applications, with LTTng and UST, dynamic tracepoints in the Linux kernel with kprobe, nested requests spans with the Open Tracing / Open Telemetry API, and OTF2 traces with Score-p in HPC applications. Userspace dynamic tracepoints, traced from the kernel, are also available to LTTng with uprobe, but their overhead is higher.

Both static and dynamic instrumentation remains a largely manual effort. Developers need to determine where and how to instrument their application at the source code or binary level, as well as which tracepoints to enable, when they want to monitor an application. Such processes require a deep understanding of the traced programs. Given a tracing budget, tracepoints will be activated such that only those capturing the most useful and complementary information will be selected. A combination of static and dynamic analysis techniques can be used for such a task.
Hardware tracing is now available on several platforms like Intel x86-64 and ARM. For example, one can use hardware tracing to record the sequence of executed instructions (i.e., program flow trace). When data tracing is not available, it may be possible to instrument memory accesses with a hardware tracing statement (e.g., ptwrite instruction on Intel x86-64) that can send a 64 bits argument into the trace in 1 or 2 execution cycles, much faster than fully software tracing. This constitutes another source of tracing data that may provide detailed information at a reduced overhead.
Static tracing allows developers to insert tracepoints in the source code of programs under investigation. The tracepoints are then compiled together with the traced program. Dynamic tracing allows developers to add tracepoints into a running program. Dynamic tracing typically introduces a higher performance overhead, as it involves the dynamic insertion and removal of tracepoints. Facilities for dynamic tracing include kprobe (for the Linux kernel) and uprobe (for user space tracing). Although such dynamic tracing tools allow developers to trace kernel and applications at runtime, they present important limitations on overhead, or where tracepoints can be inserted.
Programming languages like C/C++ allow direct access to memory objects through pointers, which can lead to memory-related faults (e.g., memory corruption or memory leaks) because of illegal memory accesses (IMAs). IMAs can often lead to non-determistic failures that are difficult to detect and diagnose. Thus, it is important to verify memory accesses and detect such IMAs automatically. More flexible approaches may be used to selectively monitor and verify accesses to specific objects.

Track 2: Heterogeneous and High Performance Computing

The second track concerns Heterogeneous and High Performance Computing (HPC), and particularly concentrates on the efficient collection of data to monitor their performance and behavior. HPC supercomputers process data with thousands of nodes and enable applications such as: handling data sent by millions of cellular devices, predicting the weather, simulating test scenarios for drugs, and the training of machine learning models for interactive chat systems and self-driving vehicles. Almost all the top supercomputers in the world use a heterogeneous architecture with CPUs controlling the different nodes and GPU co-processors providing the bulk of the computational power. These systems use specific computing models, such as MPI, OpenMP, and CUDA, that are not used in more traditional CPU computing systems. Their sheer scale and specific computing models mean that the analysis of their monitoring data must therefore be adapted to their specific situation.

At first glance, HPC may be seen as simply a special case of large distributed heterogeneous Cloud systems. However, HPC presents undeniable differences that deserve a special treatment. The first obvious difference is the scale of the computation. In a large Cloud system, tens of servers, among thousands, will typically collaborate to complete a specific request, while other servers process different requests or applications. In a HPC system, the thousands of computers may be working together on a single very large problem. For instance, Frontier, the top supercomputer in the world, comes with close to 10000 nodes, each with 64 CPU cores and 4 MI250X GPUs. Such highly parallel co-processors are also used in other applications, like Telecom Radio Base Stations, where highly parallel Digital Signal Processors are used. Those GPU co-processors contain hundreds of SIMD processors, each with tens of ALUs. When tracing such systems, the scale of the distributed system involved brings specific challenges.

HPC generally presents a heavy reliance on GPUs for parallel computations. To achieve maximum performance, HPC users must understand how to use the available resources most efficiently. Tracing and profiling applications is an elemental part of their optimization. Some of the compute kernels sent to GPUs are complex, and should therefore be traced on the GPU itself. However, currently, tracing for GPUs is still in its infancy and has yet to catch up to the capabilities that can be achieved by CPU tracing. On CPUs, software tracers like LTTng or ETW are prevalent, and offer the kind of flexibility that would be necessary to handle compute kernels with thousands of lines of code. This is not the case for GPUs now. GPU processors offer many performance counters for profiling, and limited documented facilities for hardware tracing. The proposed work will focus on minimizing the tracing overhead for GPU programs and explore new applications for software GPU tracing.
Supporting HPC, and particularly GPU programming models, requires a suitable debugging infrastructure. Recently, through ROCgdb, AMD allowed much greater visibility into the state of their GPUs during execution. However, the scale of the GPUs poses specific challenges with thousands of ALU core threads, grouped in waves, and as many registers to monitor and visualize through the debugger. These same challenges are also faced by other parallel heterogeneous architectures, such as those for signal processing in Radio Base Stations. This brings several new difficulties to the debugging engine and the associated user interface. While some work was initiated to connect ROCgdb to the Theia IDE, much remains to be done. The proposed project will address the problem at two levels. First the project will address the scale of data, both its extraction from the GPU and its management from within ROCgdb. Secondly, the project will explore views to efficiently present this data to users, without overwhelming them with thousands of monitored threads and variables.
While efficient algorithms and techniques are indeed essential to the highly scalable tracing, profiling, and debugging necessary for GPUs and HPC, the results of these algorithms must ultimately be presented to human beings. Indeed, global computations like the critical path analysis, as used in CPU tracing, can be a bottleneck to handle traces from large parallel systems such as GPUs. It is unreasonable to expect humans to effectively monitor thousands of compute nodes working in parallel. Therefore, suitable analysis and visualisation techniques must be used to process the data generated by GPU and HPC traces.

Track 3: Distributed Cloud Applications and IDE

Distributed applications are increasingly deployed in the Cloud, as modular micro-services, within a complete framework consisting of virtual machines (e.g., Linux KVM), orchestrated containers (e.g., Kubernetes and Docker), messaging oriented middleware (e.g., ZeroMQ), and language runtime (e.g., Java JVM or JavaScript Node.js). As a consequence, it becomes very difficult to debug performance problems, because of the numerous layers involved, especially if the problem lies at the intersection of two or more layers. Another challenge, in this complex multi-layered context, is to understand why some requests are served quickly, while others encounter much longer latencies. Finally, the architecture of Integrated Development Environments (IDEs) must adapt to the topology and scale of these distributed applications. The objective of Track 3 is to address those challenges.

The environment in which industrial partners deploy newer distributed applications presents new challenges. In particular, the problems encountered, and the shortcomings of the tools available to monitor and diagnose them, will be closely examined. Specific instrumentation, within the virtualisation, container orchestration, messaging oriented middleware and language runtime, is necessary to reconstruct at analysis time all the interactions between the different micro-services. New algorithms for the analysis of the events produced by this instrumentation, and views to display graphically the interactions between the modules will be developed.
In networking applications, dedicated packet processing hardware is added to general purpose computers to obtain heterogeneous computers that are both very flexible and achieve very high packet processing rates. Because of the complex and sophisticated software stack running on those systems, problems do occur and sometimes require restarting the system. Waiting until the system has restarted is not acceptable, because these complex software stacks, supporting numerous networking protocols, take a long time to restart, including reloading all the routes. It is possible to failover to a redundant system, but the associated cost is doubling the hardware. An interesting solution is to have a failover virtual machine, running on the same hardware, already initialised and ready to take over the network load. The challenge there is adding this virtualisation layer, that allows reusing the same hardware for the failover virtual machine, without affecting the performance. Different techniques have been developed to obtain native performance in a virtual machine, like the static partitioning of resources (memory and CPU cores), and direct access to hardware with PCI passthrough, VirtIO or SR-IOV. The difficulty is that any imperfection in the setup will prevent getting native performance from the virtual machine.
With proper tracing and monitoring tools, it may be possible to fully characterize the performance of requests through all the layers involved. You can then compare requests that differ in performance, between two versions of the software, on two different hardware setups, or under different loads. Sometimes, even though the software and environment have in theory not changed, different requests differ significantly in performance, seemingly randomly. Analysing in details several requests, to understand the differences, takes time.
Through the years, a large number of specialised views were developed, for trace analysis and viewing tools such as Trace Compass. The resulting large number of views presents several challenges in terms of software maintenance, but more importantly in terms of overload for the user. There are too many views to choose from, and too many different concepts used throughout. The objective is to propose an evolution of the Integrated Development Environments (IDEs), for Cloud applications, to insure that they can easily connect to the different components and layers of those applications, while offering a coherent and simplified number of views for displaying the results.

Track 4: Automating System Monitoring and Anomaly detection

Machine learning (ML) is a powerful tool employed to analyze vast amounts of data generated by monitored systems. ML algorithms can detect patterns and anomalies that might be difficult or impossible for humans to identify, thus reducing the amount of manual intervention, required for high level analysis of tracing and monitoring data. For instance, ML can be used to predict when a system might fail, or when certain resources might become overloaded. This allows system administrators to take proactive measures to prevent issues before they occur. Due to the complexity of modern computer systems, novel and unexpected behaviors frequently occur. Such deviations are either normal, such as software updates and new users, or abnormal, such as misconfigurations, intrusions and bugs. Regardless, novel behaviors are of great interest to developers, and there is a genuine need for efficient and effective methods to detect them. Researchers now consider system calls to be the most fine-grained and accurate source of information to investigate the behavior of computer systems. The latest research by Daniel Aloise and his team involves leveraging a probability distribution over sequences of system calls. This approach can be thought of as a language model that estimates the likelihood of specific sequences. Since novelties, by definition, differ from previously observed behaviors, they are unlikely to be generated by this model. Building on the success of neural networks for language modeling, these works have proposed transformer architectures for the task.

Computer systems are constantly evolving, resulting in an expanding set of known behaviors. Consequently, language models require continuous updates to adapt to these changes. In a recent study by, it was observed that autoregressive language models trained at a sufficient scale possess an impressive ability to learn new language tasks with only a few examples. The aim is to assess their capacity to assimilate new behaviors with limited samples. Detecting novel behaviors as they arise is undeniably crucial. However, scaling neural networks often comes with increased computation and memory requirements.
In addition to scalability, interpretability and robustness are crucial aspects for the use of language models in the context of automated systems monitoring. While attention weights from Transformer language models may not directly elucidate the models output, we will focus on enhancing their interpretability through recent techniques, such as averaging attention scores. Robustness is another significant concern, and various techniques have been proposed to tackle this issue. w will delve into the application of sharpness-aware minimization (SAM), which aims to identify parameters in neighborhoods with consistently low loss. Another application that we have been studying with the industrial partners revolves around the analysis of logs, to group related error messages and identify those that should be investigated in priority. Log files are an essential component of modern software engineering. They are generated by the application as it executes, and they record various events and actions performed by the system, such as user actions, errors, warnings, and system events. Their importance has grown significantly in recent years, mainly because they contain valuable information about the behavior of software applications. By analyzing log files, developers can gain a better understanding of how the application is operating, identify potential issues or bugs, and troubleshoot errors that may be occurring. However, modern software applications generate a tremendous amount of data, including logs that can number in the millions or even billions of entries. Without an effective tool to help their work, engineers would have to sift through all of these entries manually, which is impractical and time-consuming. By training ML models, engineers can deploy tools that can quickly and accurately analyze log files to identify patterns and trends in the data.
The proposed approach will involve utilizing row-wise data embeddings to determine the similarity between the provided context and the provided log lines. This is expected to allow engineers to quickly locate potential lines within the log that indicate the source of errors or warnings in the monitored system, facilitating further analysis.
Detecting causality between logs is distinct from the task of identifying log anomalies. In log anomaly detection, the objective is to pinpoint log lines that deviate from the norm among a larger set of regular log lines. On the other hand, methods aimed at uncovering causality between logs generate a measure of causality potential. For instance, a coefficient ranging from 0 to 1 may be assigned, where 0 indicates an error log line that has no potential to generate other errors (such as a build log warning that does not impact the deployment process), while 1 represents an error log line that triggers subsequent errors (such as a memory leak error). Presenting this information to software developers or infrastructure analysts can prove invaluable in assessing the significance of addressing a specific error.