Automated monitoring and debugging of large scale manycore heterogeneous systems
The new research project, entitled "Automated monitoring and debugging of large scale manycore heterogeneous systems" is decomposed into 4 trucks which are detailed in this page.
- Track1: Data collection through the whole hardware/software stack
- Track2: Architecture for real-time and scalable system-level and cloud-level monitoring and analysis
- Track3: Anomaly Detection and diagnosis with Machine Learning
- Track4: Tracing and debugging support for advanced programming environments
The communication and computing infrastructure has evolved through the years getting more efficient, sophisticated, integrated and networked. To this end, the traditional central processing units are now getting support from specialised co-processing units to speedup specific tasks such as graphics display (GPGPUs with thousands of cores), networking, signal processing or even for Machine Learning. These newer heterogeneous systems are becoming more complex at an even faster rate and are used not only in mobile devices and servers, but also in intelligent devices, (the Internet of Things or IoT), such as autonomous cars, smart robots or automated video surveillance. These processing units are highly parallel and may contain over 8 billion logic elements (transistors) each. For example, newer Graphical Processing Units (GPU), often used for General Purpose computing (GPGPU) contain several thousand computing cores.
As a result, even a simple operation such as initiating a phone call, making a Web search, routing a packet or displaying a video frame can involve many parallel cores on more than one processing units, possibly on several servers. Moreover, the same operation a few seconds later may be served in a different way, by different cores and physical servers. Therefore, understanding the performance of these operations has become extremely difficult and the tools for that purpose are severely lacking. In this project, the tracing, monitoring, profiling and debugging tools for manycore systems will be rearchitected to efficiently extract information from all units in all layers, from the hardware to the application, and to cope with the large number (several thousands) of cores. Furthermore, as manual problem investigation is becoming increasingly difficult, given the systems sophistication and the thousands of cores, a particular emphasis of this research project is to develop new methods and algorithms to automate the analysis of the extracted monitoring data, through Machine Learning techniques.
The availability of these tools will simplify and automate the debugging, tuning and monitoring of new complex applications, running on heterogeneous manycore processors in the era of the Clouds and the Internet of Things. With this, the engineers will be able to quickly understand the system behavior and performance, and optimise its operation, leading to a faster design of more efficient products.
Track 1 will study the tracing information that may be available at different levels in the complete hardware and software stack, in order to insure that all the needed information can be extracted.
The foundation for monitoring tools is the low-disturbance data collection through the whole software and hardware stack. The foundation for monitoring tools is the low-disturbance data collection through the whole software and hardware stack. LTTng has an infrastructure in place to efficiently collect tracing data from tracepoints inserted statically or dynamically in the operating system, in applications, and even in applications running on bare-metal. In addition, hardware tracing support is now available in most of the general purpose central processing unit architectures such as ARM, FreeScale QorIQ, and the Intel X86; these often provide the lowest overhead and good scalability.
The problem however is that newer specialised co-processors have not reached the same achitectural maturity and offer less hardware support for tracing and profiling. GPGPUs do contain performance counters and limited hardware tracing support. However, this is typically undocumented, only accessible in a limited way through closed-source libraries and tools. This is changing with AMD opening up a large portion of its software stack through the GPU Open initiative. Similarly, new sources of tracing data will be needed with other co-processors such as the Adapteva Epiphany V or with the Google Tensor Processing Unit. The same holds true for telecom and networking equipment such as switches, where packet co-processors are coupled with general purpose central processing units. Tracing these packet co-processors is required for tracking difficult performance or logic bugs.
The main difficulty, when many diverse sources of tracing data are available, is to properly correlate the events from different sources that correspond to interactions. As a first step, time synchronisation is required. Efficient algorithms have been developed for this purpose but they need to be adapted to the different types of interactions, since these serve as reference points for the synchronisation. Even more difficult is to follow all the links between the events in the different sources and layers. To be able to compute the critical path for a certain task, sufficient information must be available. We have obtained extremely interesting results for many applications interacting through standard system calls. However, for interactions between the central processing unit and co-processors such as GPGPUs, this information is much more difficult to obtain.
Many systems are now capable of producing extremely detailed tracing data. The challenge then becomes to select which data sources to activate, and for which time interval. Efficient algorithms were developed to provide a framework to trigger data collection and storage upon encountering specific trigger conditions (e.g., large system call latency) in the operating system or in applications. Similar new algorithms and mechanisms are required for data sources coming from different co-processors.
Truck1 aims to develop algorithms and techniques to better integrate new data sources, associated with co-processors, into the tracing and monitoring framework. It includes hardware traces, hardware performance counters and software instrumentation in the runtime support. This is in order to obtain information about all important events of the execution, and link them to events in the central processing unit. In addition, it will propose algorithms to dynamically adjust the level of tracing details.
Track 2 will propose a new architecture and algorithms to provide a hierarchical streaming framework for the control, monitoring, aggregation and analysis of tracing and debugging data.
Once tracing sources become available, a proper framework is required to monitor, aggregate and process this data. The traditional approach is to collect all data during the execution and process it at a later time. However, when real-time online monitoring is required, sometimes to dynamically adjust the tracepoints to activate and snapshots to record, this approach is not sufficient.
The challenge is at several levels. Within a single processing unit, whether a 4096 cores GPGPU or a 1024 cores Epiphany V, it is impossible to route all the available detailed tracing data to the outside, without severely impacting the performance and changing the system behavior. Thus, a suitable organisation should be proposed to reduce the data collected at any given time through selective activation, filtering, aggregation, sampling and similar techniques. To this end, it may be required to dedicate some of the available cores to monitoring tasks, such as aggregation and anomaly detection. The same problem arises at the next level, when a cluster or cloud contains thousands of nodes. Here again, some of the nodes may dedicate part of their resources to monitoring tasks.
The monitoring infrastructure ultimately serves to detect and diagnose problems. In a large system running continuously, it is simply impossible to record everything and browse through stored traces at a later time. The trace viewing and analysis tools thus need to be integrated with the runtime system, in order to interact with the monitoring and aggregation processes. These may in turn decide to store portions of traces for later detailed interactive analysis.
At this scale, the trace analysis itself will benefit from specialised co-processors (e.g., GPGPUs) and parallel processing streaming frameworks. While it would be conceptually cleaner to completely separate the trace monitoring and analysis framework from the system monitored, it is often more efficient to process the data locally, where it is created, and use dedicated resources within the co-processing units and the cluster to perform the required processing. A suitable interface will then be required between these distributed monitoring and analysis nodes and the user display application.
In track 2 the aim is to propose a new architecture and algorithms to provide a hierarchical streaming framework for the control, monitoring, aggregation, and analysis of tracing and debugging data. Moreover, this flexible architecture must support specialised user-defined analysis modules and interface to different display applications.
Track 3 will study the problem of automated analysis, proposing new algorithms and data processing to reduce the manual intervention required to detect, diagnose and correct problems in the monitored systems.
Low-level tracing is often mainly used by system experts who are able to devise efficient strategies to quickly find anomalies and problems. However, as the number and complexity of digital systems increases, the need for automated monitoring, anomaly detection and problem diagnosis becomes ever more obvious. During the discussions with the industrial partners, automated analysis was an important common need identified. A system may perform badly because of an improper configuration, a change in the environment, unusual possibly malicious network traffic, or simply an inefficient code modification. With proper tracing tools, a human will eventually find the problem, reading trouble reports (TR), looking at various metrics, comparing sequences of events, and contrasting the behavior of the problematic system with that of a correct system.
Recent advances in Machine Learning have led to impressive results in automating decision and diagnosis tasks, whether for intrusion detection, malware detection, Internet traffic classification, code correlation and program optimisation or in other applications like game playing, algorithmic trading or medical diagnosis. These techniques will be harnessed to automatically find correlations between changes in the code, the configuration or the environment, and the changes in performance.
There are several libraries and frameworks for applying Machine Learning techniques, like Weka, MOA, Apache Spark, Apache Singa and Google tensorflow. These tools are flexible in terms of input data, but a proper structure and model is required. The complexity and volume of the tracing data, and the complexity of the multi-core, multi-node, multi-layer systems studied are significant challenges.
In track 3, Anomaly Detection and diagnosis with Machine Learning, the aim is to enable the developer to properly model the semantics of the tracing events. Then, with the derived metrics and links, Machine Learning techniques will be proposed to group different trace segments, corresponding to different task executions, into clusters. These clusters (e.g., fast and slow executions) will then be compared in order to identify the differences (code versions, configuration, network traffic) and thus the underlying root causes, possibly linked with trouble reports.
Track 4 focuses on providing specialised algorithms, analysis and views to support complex software environments, with a large number of parallel co-processor cores, a large cloud of physical nodes, and distributed applications with mobile code.
The focus of this project is to adequately support large scale heterogeneous systems. The first 3 tracks propose algorithms and an architecture to achieve this goal. Nonetheless, specialised analysis and views are required to support specific representative applications that run on such systems. Three different classes of applications are targeted. The applications selected are in widespread use, come with an open source reference implementation, and exercise the large scale heterogeneous systems targeted by this project.
The first application is parallel programming environments, for manycore systems and GPGPUs, such as OpenMP and OpenCL, and also dataflow programming, which has been used for signal processing on earlier Epiphany chips and for TensorFlow on GPGPU chips. In addition to the framework proposed in tracks 1 to 3, supporting this application requires linking the individual computation events (one computation on one core) with the associated code line in the high level parallel programming model. Similarly challenging is designing a suitable graphical interface that can show an overview of the system state, representing thousands of parallel cores, but can also be used to dig into a problem and display the detailed state of a specific core.
The second target is cloud based applications. The environment consists in a very large number of nodes running a Cloud Computing stack with virtualisation, such as OpenStack and OpenNFV. On top of that, distributed applications such as Web services (e.g., Linux Apache MySQL and PHP) and MapReduce parallel computing (e.g., Apache Spark) complete the stack. Here again, the monitoring infrastructure relies on the work in tracks 1 to 3, but also requires support to map logical computations and network nodes to physical hardware nodes and network switches.
In track 4, tracing and debugging support for advanced programming environments, the aim is to propose new algorithms and views to support complex software environments, with a large number of parallel co-processor cores, a large cloud of physical nodes, and distributed applications with mobile code. This will serve a dual purpose, adding support for these complex use cases and validating that the algorithms and architecture proposed in the first 3 tracks indeed can support such use cases efficiently.