Monitoring and Debugging of High Performance Distributed Heterogeneous Cloud Applications

The new research project, entitled "Monitoring and Debugging of High Performance Distributed Heterogeneous Cloud Applications" is decomposed into 4 tracks which are detailed in this page. Tracks 1, 2 and 4 are supervised by professor Michel Dagenais, while track 3 is supervised by professor Daniel Aloise. For students whose project overlaps a few tracks, there may be co-supervision.

Introduction

The communication and computing infrastructure is getting ever more sophisticated at an extremely rapid pace. Recent applications include 5G connected mobile devices, autonomous cars, smart robots and intelligent digital assistants powered by Machine Learning. These advances are made possible by a number of technological developments at the hardware and software levels, such as computer central processing units with tens of cores, coprocessors for graphics and intensive computations (GPGPU) with thousands of cores and over 18 billion logic elements, 5G low latency high speed networking, and Cloud based infrastructures that execute their requests in parallel.

As a result, even a simple operation such as initiating a phone call, making a Web search, routing a packet or displaying a video frame, can involve many parallel cores on more than one processing unit, possibly on several servers. Moreover, the same operation, a few seconds later, may be served in a different way by different cores and physical servers in the Cloud. Therefore, understanding the performance of these operations has become extremely difficult and the tools for that purpose are severely lacking.

In this project, the tracing, profiling, debugging and monitoring tools for High Performance Distributed Systems will be extended to efficiently extract information from all units in all layers, from the hardware to the applications, and cope with the large number of cores and computers. The project has a specific focus on Cloud applications connecting to mobile and Internet of Things devices through Edge servers and 5G networks, High Performance Computing exploiting the new generation of shared memory GPGPUs, Machine Learning applications, and a new modular architecture for more integrated software development tools. As a result, the designers and operators of High Performance Distributed Systems will have the tools in hand to quickly analyse their system performance, automatically or manually find problems, and optimise operations.

up

Track 1: Distributed Applications in the Cloud, Edge and 5G network

In this track, we propose a new architecture and algorithms to efficiently monitor and analyse the performance of distributed applications, with numerous Cloud, Edge and IoT nodes, system and network virtualisation, container orchestration, micro-services and messaging frameworks.

We survey the field of container and cloud orchestration, like in Kubernetes and OpenStack, and associated monitoring, profiling, tracing and debugging tools. With input from teams at Ericsson, Ciena and EfficiOS, we propose a comprehensive instrumentation schema to extract runtime data about the creation and scheduling of VMs and containers. Thereafter, a complete analysis and visualisation setup is proposed.

We survey the literature about Cloud, Edge, IoT and 5G environments, and about the scalability of Cloud monitoring and debugging tools. We work toward an efficient layered organisation, having some local processing and aggregation at each level, before sending information to the higher level. The survey also includes the area of streaming analytics (e.g. Apache Spark) in order to efficiently analyse trace data in parallel. We then propose and prototype an efficient, scalable, and hierarchical organisation for Cloud monitoring and debugging.

We also look at micro-services in the Cloud, first surveying the literature about micro-services and associated monitoring tools (e.g. Open Tracing). Many newer Telecom and networking systems in industry are based on micro-services, and provide important use cases and requirements. We propose an efficient organisation to instrument micro-services and to analyse and visualise the combined information. We examine messaging frameworks used in the Cloud, like AMQP and ZeroMQ. This is commonly used in financial analysis systems, and in new modular Integrated Development Environments. We review the literature about messaging systems and related monitoring tools. We prototype different instrumentation and analysis strategies for Theia, and for micro-services systems based on ZeroMQ.

up

Track 2: Heterogeneous Multi-Core Coprocessors

In this track, we propose new algorithms and techniques to efficiently monitor and analyse the performance of systems using heterogeneous coprocessors. Newer high performance GPUs with shared virtual memory and user level queues, and networking data plane accelerators (including digital signal processors for cellular radio signals) are especially targeted.

We survey the field of GPU coprocessors and related tracing and profiling tools. This area is opening up, now that AMD has provided its GPU software development toolchain as Open Source. We study the architecture of new high performance GPUs and how they can be traced and profiled using software and hardware support. We then propose new algorithms for instrumenting GPU programs, extracting tracing and profiling data from GPUs, and analysing and visualising this information. We focus on several challenges, such as the limited bandwidth to extract tracing data from thousands of cores and the computation of bottlenecks, often referenced as Roofline Models. The proposed views and analysis will be presented to GPU experts for feedback.

We survey the field of debuggers, study the new ROCm Debug Agent, review typical GPU applications and their debugging needs, and discuss with potential users working on High Performance and Machine Learning applications. We study the GDB internals, the relationships between the debugger and the IDE user interface, and the debugging capabilities offered by advanced GPUs and the ROCm Debug agent. We propose extensions to the current debugging tools to represent important concepts of GPU programming, like vector operations, thread waves and hyperthreading. We propose an efficient architecture for the debugging toolchain, and efficient algorithms in order to connect the ROCm Debug Agent, GDB, and the graphical user interface in Theia.

We study the field of GPU virtualisation, Cloud Management, Container runtime, and HPC queuing systems for supporting parallel applications. We propose efficient and flexible organisations for running parallel HPC applications (e.g. based on MPI, OpenMP, HIP and OpenCL) in the Cloud. We survey the different components of the networking stack, from the network virtualisation down to the hardware support, and look at related performance analysis and debugging tools. We propose different instrumentation strategies, and specialised analysis and views for understanding the routing and latency of the packets.

up

Track 3: Machine Learning and performance analysis Tools

In this track, two complementary research streams are followed. The first one concerns the development of Machine Learning (ML) models for extracting knowledge from execution traces. Tracing is an extremely valuable means to detect, classify, and highlight the cause of anomalies in complex systems. The second stream focuses on the performance analysis of ML applications that use coprocessors (e.g. GPU) and frameworks (e.g. Google TensorFlow).

We work to improve anomaly detection using kernel trace data. Due to the resemblance of trace data to natural language, the latest deep learning techniques from natural language processing (NLP), such as the self-attention mechanism, are examined. It allows processing variable size sequences without having to use recurrent connections, as well as providing insights into the importance of each input part. We focus on anomaly detection in micro-services. The objective is to automatically identify faulty source code, based on its history and system traces. We also investigate trace data from the communication events.

We work on the automatic exploration of execution by dynamically injecting trace points using ML. We aim to use a minimal set of traced system calls and detect anomalies before they occur. A research objective is to dynamically enable subsets of events, depending on the predicted likelihood of an anomaly. More events yield a better prediction, but at a higher computational cost. We are tracing popular ML models and common ML frameworks. The idea is to investigate which ML models are best suited for different computational settings, based on fine-grained performance analysis.

We work on ML models for trace analysis and anomaly detection using a minimal set of system calls, and extend the method to early detection. We also work on creating a repository of data sets. The massive data volume created by tracing implies computational constraints that are being addressed. We established the characteristics of ML applications that need to be traced, and then develop visualisation tools in TraceCompass to analyse the execution of ML applications. After identifying the performance problems of each ML model, we provide guidelines for the efficient parallelisation of ML applications (e.g. on GPU).

up

Track 4: New Architecture for Tool integration

We propose a new organisation, within Integrated Development Environments (IDE), to tackle the increased complexity of hardware and software platforms.

We survey existing IDEs, such as CDT, Che and Theia from Eclipse, and Visual Studio Code. We focus on the efficient interactions between different components of the IDE: user interface (e.g. Electron), language servers (e.g. clangd), debuggers (e.g. GDB), trace analysis (Trace Compass backend) and profiling. We also study the complementary tools that may interact with the development environment, including Continuous Integration tools, and Dashboards such as Kibana. We model the performance of Theia and other similar IDEs in order to detect scalability bottlenecks and propose new algorithms and architectural improvements. The proposed alternatives are prototyped, discussed and evaluated with the help of development teams at Ericsson, Ciena, AMD and EfficiOS. We examine the problem of dynamically adjusting the level of details for tracing, profiling and debugging tools, in the context of limited resources available on some embedded platforms, or systems with thousands of cores.

We survey the literature on dynamic instrumentation techniques, and the static or dynamic selection of what functions and other code locations to instrument for tracing, memory debugging and other validation tasks. We propose and prototype different analysis and strategies to optimise the selection of instrumentation points. The proposed methods are prototyped to validate their effectiveness with real industrial cases.

We study the state of the art concerning the performance analysis of real-time applications, in the context of low latency 5G networks and Edge Computing. We develop new analysis and views to highlight portions of the system that create bottlenecks, affecting the critical path, and thus the real-time performance of the applications studied. We study the literature on specialised debugging tools and runtime verification libraries. We study the various tools used in industry for validation and debugging purposes, especially for memory accesses validation in embedded systems. We propose new algorithms and a new tool architecture for interactive runtime verification, based on Theia and GDB.

up