ARM Research Summit 2017 Workshop

From gem5
Revision as of 06:09, 26 July 2017 by Andysan (talk | contribs) (Updated schedule with short talks)
Jump to: navigation, search

The ARM Research Summit is an academic summit to discuss future trends and disruptive technologies across all sectors of computing. On the first day of the Summit, ARM Research will host a gem5 workshop to give a brief overview of gem5 for computer engineers who are new to gem5 and dive deeper into some of gem5's more advanced capabilities. The attendees will learn what gem5 can and cannot do, how to use and extend gem5, as well as how to contribute back to gem5.

The ARM Research Summit will take place in Cambridge (UK) over the days of 11-13 September 2017. The gem5 workshop will be a full day event on the 11th September.

Target Audience

The primary audience is researchers who are using, or planning to use, gem5 for architecture research.

Prerequisites: Attendees are expected to have a working knowledge of C++, Python, and computer systems.


See the main ARM Research Summit website for details about registration.

Can I contribute?

Yes! We can still accommodate a shorter talks (15 min including questions). Let Andreas Sandberg or Nikos Nikoleris if you have an interesting gem5-related project you would like to present.

Preliminary Schedule

The workshop will take place on Monday the 11th September 2017 at Robinson College in Cambridge (UK). The workshop schedule has not been finalised yet. We plan to cover the following topics:

  • Introduction to gem5
  • What is gem5 and where do I get it?
  • Running Workloads: Workload automation and best practice


Trace-driven simulation of multithreaded applications in gem5

The gem5 modular simulator provides a rich set of CPU models which permits balancing simulation speed and accuracy. The growing interest in using gem5 for design-space exploration however requires higher simulation speeds so as to enable scalability analysis with systems comprising tens to hundreds of cores. One relevant approach for enabling significant speedups lies in using trace-driven simulation, in which CPU cores are abstracted away thereby enabling to refocus simulation effort on memory/interconnect subsystems which play a key role on performance. This talk describes some of the work carried out on the Mont-Blanc european projects on trace-driven simulation and discusses the related challenges for multicore architectures in which trace injection requires to account for the API synchronization of the underlying running application. The ElasticSimMATE tool is presented as an initiative towards combining Elastic Traces and SimMATE so as to enable fast and accurate simulation of multithreaded applications on ARM multicore systems.

Dr Gilles Sassatelli is a CNRS senior scientist at LIRMM, a CNRS-University of Montpellier academic research unit with a staff of over 400. He is vice-head of the microelectronics department and leads a group of 20 researchers working in the area of smart embedded digital systems. He has authored over 200 peer-reviewed papers and has occupied key roles in a number of international conferences. Most of his research is conducted in the frame of international EU-funded projects such as the DreamCloud and Mont-Blanc projects.

Alejandro Nocua received the Ph.D. degree in Microelectronics from the University of Montpellier, France, in 2016. Currently, he is a postdoctoral researcher at the French National Center for Scientific Research (CNRS). His research interests include the analysis of high-performance and energy-efficiency design methodologies. He received his Master degree in Science from the National Institute of Astrophysics, Optics and Electronics (INAOE), Mexico, in 2013. Alejandro was awarded his BS degree in Electronics Engineering from Industrial University of Santander (UIS), Colombia in 2011.

Florent Bruguier received the M.S. and Ph.D. degrees in microelectronics from the University of Montpellier, France, in 2009 and 2012, respectively. From 2012 to 2015, he was a Scientific Assistant with the Montpellier Laboratory of Informatics, Robotics, and Microelectronics, University of Montpellier. Since 2015, he is a Permanent Associate Professor. He has co-authored over 30 publications. His research interests are focused on self-adaptive and secure approaches for embedded systems.

Modeling Cache Coherence with gem5

Correctly implementing cache coherence protocols is hard and these implementation details can affect the system's performance. Therefore, it is important to robustly model the detailed cache coherence implementation. The popular computer architecture simulator gem5 uses Ruby as its cache coherence model providing higher fidelity cache coherence modeling than many other simulators.

In this talk, I will give a brief overview of Ruby, including SLICC: the domain-specific language Ruby uses to specify cache protocols. I will show the extreme flexibility of this model and details of a simple cache coherence protocol. After this talk, you will be able to dive in and begin writing your own coherence protocols!

Jason Lowe-Power is an Assistant Professor at University of California, Davis in the Computer Science department. Jason's research focuses on increasing the energy efficiency and performance of end-to-end applications like analytic database operations used by Amazon, Google, Target, etc. One important aspect of this research is adding hardware mechanisms to systems that enable all programmers to use emerging hardware accelerators like GPUs. Additionally, Jason is a leader of the open-source architectural simulator, gem5, used by over 1500 academic papers. Jason received his PhD from University of Wisconsin-Madison in Summer 2017. He was awarded the Wisconsin Distinguished Graduate Fellowship Cisco Computer Sciences Award in 2014 and 2015.

A Detailed On-Chip Network Model inside a Full-System Simulator

Compute systems are ubiquitous, with form factors ranging from smartphones at the edge to datacenters in the cloud. Chips in all these systems today comprise 10s to 100s of homogeneous/heterogeneous cores or processing elements. The growing emphasis on parallelism, distributed computing, heterogeneity, and energy-efficiency across all these systems makes the design of the Network-on-Chip (NoC) fabric connecting the cores critical to both high-performance and low power consumption.

It is imperative to model the details of the NoC when architecting and exploring the design-space of a complex many-core system. If ignored, an inaccurate NoC model could lead to over-design or under-design due to incorrect trade-off choices, causing performance losses at runtime. To this end, we have designed and integrated a detailed on-chip network model called Garnet inside the gem5 ( full-system architectural simulator which is being used extensively by both industry and academia. Together with Garnet, gem5 provides plug-and-play models of cores, caches, cache coherence protocols, NoC, memory controller, and DRAM, with varying levels of details, enabling computer architects and designers to trade-off simulation speed and accuracy.

In this talk, we will first introduce the basic building blocks of NoCs and present the state-of-the-art used in chips today. We will then present Garnet, and demonstrate how it faithfully models the state-of-the-art, while also offering immense flexibility in modifying various parts of the microarchitecture to serve the needs of both homogeneous many-cores and heterogeneous accelerator-based systems of the future via case studies and code-snippets. Finally, we will demonstrate how Garnet works within the entire gem5 ecosystem.

Tushar Krishna is an Assistant Professor in the Schools of ECE and CS at Georgia Tech. He received a Ph.D. in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology in 2014. Prior to that he received a M.S.E from Princeton University in 2009, and a B.Tech from the Indian Institute of Technology (IIT) Delhi in 2007, both in Electrical Engineering.

Before joining Georgia Tech in 2015, Dr. Krishna was a post-doctoral researcher in the VSSAD Group at Intel, Massachusetts, and then at the Singapore-MIT Alliance for Research and Technology at MIT.

Dr. Krishna's research interests are in computer architecture, interconnection networks, networks-on-chip, deep learning accelerators, and FPGAs.

System Simulation with gem5, SystemC and other Tools

SystemC TLM based virtual prototypes have become the main tool in industry and research for concurrent hardware and software development, as well as hardware design space exploration. However, there exists a lack of accurate, free, changeable and realistic SystemC models of modern CPUs. Therefore, many researchers use the cycle accurate open source system simulator gem5, which has been developed in parallel to the SystemC standard. In this tutorial we present the coupling of gem5 with SystemC that offers full interoperability between both simulation frameworks, and therefore enables a huge set of possibilities for system level design space exploration. Furthermore, we show several examples for coupling gem5 with SystemC and other tools.

Matthias Jung received his PhD degree in Electrical Engineering from the University of Kaiserslautern Germany in 2017. His research interest are SystemC based virtual prototypes, especially with the focus on the modeling of memory systems and memory controller design. Since may 2017 he is a researcher at Fraunhofer IESE, Kaiserslautern, Germany.

Christian Menard received a Diploma degree in Information Systems Technology from TU Dresden in Germany in 2016 and joined the chair for compiler construction as a Ph.D. student within the excellence cluster cfaed in TU Dresden. His current research includes system-level modeling of widely heterogeneous hardware as well dataflow compilers for heterogeneous MPSoC platforms.

CPU power estimation using PMCs and its application in gem5

Fast and accurate estimation of CPU power consumption is necessary to inform run-time power management approaches and allow effective design space exploration. Power simulators, combined with a full-system architectural simulator such as gem5, enable power-performance trade-offs to be investigated early in the design of a system. However, the accuracy of existing power simulators is known to be low, and this can lead to incorrect conclusions being made. In this talk, I will present our statistically rigorous methodology for building accurate run-time power models using Performance Monitoring Counters (PMCs) for mobile and embedded devices, and demonstrate how our models make more efficient use of limited training data and better adapt to unseen scenarios by uniquely considering stability. Models built using the methodology for both ARM Cortex-A7 and Cortex-A15 CPUs exhibit a 3.8% and 2.8% average error respectively. I will also present online resources that we have made available from the work, including software tools, documentation, raw data and further results. I will also present results from an investigation into the correlation between gem5 activity statistics and hardware PMCs. Based on this, a gem5 power model for a simulated quadcore ARM Cortex-A15 has been created, built using the above methodology, and its accuracy compared against experimental results obtained from hardware.

Geoff Merrett is an Associate Professor in the Department of Electronics and Computer Science at the University of Southampton. He received the BEng (1st, Hons) and PhD degrees in Electronic Engineering from Southampton in 2004 and 2009 respectively. His research interests are in energy-aware and self-powered computing systems, with application across the spectrum from highly constrained IoT devices to many-core mobile and embedded systems. He has published over 100 peer-reviewed articles in these areas, and given invited talks at a number of international events. Dr Merrett is a Co-Investigator on the EPSRC-funded £5.6M PRiME Programme Grant (where he leads the applications and cross-layer interaction theme), "Continuous on-line adaptation in many-core systems: From graceful degradation to graceful amelioration", and deputy-lead on the "Wearable and Autonomous Computing for Future Smart Cities" Platform Grant. He is technical manager of Southampton’s ARM-ECS Research Centre, an award-winning industry-academia collaboration between the University of Southampton and ARM. He coordinates IoT research at the University, and leads the wireless sensing theme of its Pervasive Systems Centre. He is an Associate Editor for the IET CDS journal, serves as a reviewer for a number of leading journals, and on TPCs for a range of conferences. He co-manages the UK’s Energy Harvesting Network, was General Chair of the ACM Workshop on Energy-Harvesting and Energy-Neutral Sensing Systems in 2013, 2014, and 2015, and was the General Chair of the European Workshop on Microelectronics Education 2016. He is a member of the IEEE, IET and Fellow of the HEA.

Short Talks

Debugging a target-agnostic JIT compiler with GEM5

Author: Boris Shingarov - LabWare

We explain how GEM5 enabled us to develop a target-agnostic JIT compiler, in which no knowledge about the target ISA is coded by the human programmer; instead, the backend is inferred, using logic programming, from a formal machine description written in a Processor Description Language. Debugging such a JIT presents some challenges which can not be addressed using traditional approaches. One such challenge is the impedance mismatch between the high-level abstractions in the PDL and the low-level inferred implementation. In this talk, we present a new debugger based on simulating the execution of the target runtime VM in GEM5; the debugger frontend connects to this simulation using the RSP wire protocol.

COSSIM: An Integrated Solution to Address the Simulator Gap for Parallel Heterogeneous Systems

In an era of complex networked heterogeneous systems, simulating independently only parts, components or attributes of a system-under-design is not a viable, accurate or efficient option. The interactions are too many and too complicated to produce meaningful results and the optimization opportunities are severely limited when considering each part of a system in an isolated manner. COSSIM offers a framework that can handle the simulation of a complete system-of-systems including processors, peripherals and networks that can appeal to Parallel (Heterogeneous) Systems designers and application developers in an integrated way.

The framework is based on gem5 as the main simulation engine for processor-based systems and extends its capabilities by integrating it with the OMNET++ network simulator. This integration allows independent gem5 instances to be networked with all network protocols and hierarchies that can be supported by OMNET++, thus creating a very flexible solution. The integration of the two main simulation tools is realized through the IEEE 1516 High-Level Architecture standard (HLA), through which all communication tasks are performed. Through HLA and custom libraries, a two-level (per node and global) synchronization scheme is also implemented to ensure a coherent notion of time between all nodes.

Since HLA is IP-based all gem5 instances and OMNET++ can be executed on the same physical machine or on any distributed system (or any combination in between). The overall framework – the set of gem5 nodes, the OMNET++ simulator and the CERTI HLA – are integrated in a unified Eclipse-based GUI that has been developed to provide easy simulation set-up, execution and visualization of results. McPAT is also integrated in a semi-automated way through the GUI in order to provide power and energy estimations for each node, while OMNET++ provides power estimations for networking-related components (NICs and network devices).

Andreas Brokalakis is a senior hardware engineer at Synelixis Solutions Ltd. At the same time he is pursuing a PhD degree at the Technical University of Crete, Greece. He holds a Bachelor degree in Computer Engineering from University of Patras, Greece and a Master’s Degree on Hardware/Software Co-design from the same university. Current work and research interests involve computer architecture and arithmetic, as well as design of ASIC and FPGA systems and accelerators.

Nikolaos Tampouratzis is a PhD student at Technical University of Crete, working on simulation tools for computing systems. He has joined Telecommunication Systems Institute, Technical University of Crete since October 2012 as a research associate, providing research and development services to several EU-funded research projects. He received his Computer Science diploma from the University of Crete (UOC, Greece), with specialization in Hardware Design and FPGAs. He continued his studies in the Technical University of Crete (TUC Greece) where he received his Master Diploma in Electronic and Computer Engineering in which he specialized in Computer Architecture and Hardware Design.

Simulation of Complex Systems Incorporating Hardware Accelerators

The breakdown of Dennard scaling coupled with the persistently growing transistor counts increased the importance of application-specific hardware acceleration; such an approach offers significant performance and energy benefits compared to general-purpose solutions. In order to thoroughly evaluate such architectures, the designer should perform a quite extensive design space exploration so as to evaluate the trade-offs across the entire system. The design, until recently, has been predominantly done using Register Transfer Level languages such as Verilog and VHDL, which, however, lead to a prohibitively long and costly design effort. In order to reduce the design time a wide range of both commercial and academic High-Level Synthesis (HLS) tools have emerged; most of these tools, handle hardware accelerators that are described in synthesizable SystemC. The problem today, however, is that most simulators used for evaluating the complete user applications (i.e. full-system CPU/Mem/Peripheral simulators) lack any type of SystemC accelerator support.

Within this context, we extend gem5 to support the simulation of generic SystemC accelerators. We introduce a novel flow that enables us to rapidly prototype synthesisable SystemC hardware accelerators in conjunction with gem5. The proposed solution handles automatically all communication and synchronisation issues.

Compared to a standard gem5 system, several changes at different levels are required, from the OS and device drivers level down to the implementation of a device model in the gem5 simulator. Instead of using files to write data for an external accelerator, perform the simulation and then read back the results, our approach communicates with the SystemC simulator through programmed I/Os and DMA engines, supporting full global synchronisation. Apart from the apparent benefits concerning the implementation and simulation accuracy, the proposed solution is also orders of magnitude faster.

Generating Synthetic Traffic for Heterogeneous Architectures

Modern system-on-chip architectures consist of many heterogeneous processing elements. The communication fabric and memory hierarchy supporting these processing elements heavily influence the system’s overall performance. Exploring the design space of these heterogeneous architectures with detailed models of each processing element can be time-consuming. Statistical simulation has been shown to be an effective tool for quickly evaluating architectures by abstracting away complexity.

This talk describes work done on modelling the spatial and temporal behaviour of a processing element’s address stream. We present a methodology that can automatically characterize a processing element by observing its reads and writes. Using these characteristics we can stimulate a communication fabric connecting many different processing elements by synthetically recreating their addresses. These addresses arrive at their destination in the memory hierarchy, spawning new messages and responses to read and write requests. Architects can now combine ynthetic processing elements that represent various different components on current and future systems-on-chip to evaluate the impact of changes at the interconnection network and memory hierarchy.

Mario Badr is a PhD Candidate at the University of Toronto working under the supervision of Dr. Natalie Enright Jerger. He received his B.A.Sc. and M.A.Sc from the University of Toronto in Electrical Engineering and Computer Engineering, respectively. He has interned with Qualcomm Research Silicon Valley and received the Roberto Padovani Scholarship for his outstanding technical contributions. In addition, he has been recognized at the university and departmental levels for excellence as a teaching assistant. His research interests include performance evaluation in computer architecture, heterogeneous architectures, and multi-threaded workloads.