Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies - 2018


As the dimensions of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as serious style concerns. Efficiently running systems at such giant scales critically relies on deploying effective, practical methods for fault tolerance while having a smart understanding of their respective performance and energy overheads. The most typically used fault tolerance technique is checkpoint/restart. Checkpoint scheduling policies, but, are historically optimized and analysed from one angle: application performance. In this work, we tend to provide an extensive analysis of the performance, energy and i/O prices related to a wide array of checkpointing policies. We have a tendency to contemplate practical deployment problems and show that simple formulas can be used to accurately estimate wasted work in an exceedingly system. We propose methods to optimize checkpoint scheduling for energy savings and evaluate the runtime-optimized and energy-optimized policies using simulations based on failure logs from ten production HPC clusters. Our results show ample space for achieving high quality energy/performance tradeoffs when using ways that exploit characteristics of world failures. We have a tendency to also analyze the impact of energy-optimized checkpointing on the storage subsystem and establish policies that are optimal for I/O savings.

Did you like this research project?

To get this research project Guidelines, Training and Code... Click Here

PROJECT TITLE : Mining Online Discussion Data for Understanding Teachers' Reflective Thinking - 2017 ABSTRACT: Teachers’ online discussion text knowledge streamline their reflective thinking. With the growing scale of
PROJECT TITLE : Understanding the Relation Between the Performance and Reliability of NAND Flash/SCM Hybrid Solid-State Drive - 2016 ABSTRACT: A NAND flash memory/storage-class memory (SCM) hybrid solid-state drive (SSD) will
PROJECT TITLE :Sharing the Ride of Power: Understanding Transactive Energy in the Ecosystem of Energy EconomicsABSTRACT:Advocates of Transactive Energy (TE) create arguments for the mixing of distributed energy resources (DERs)
PROJECT TITLE :Understanding the Magnetic Polarizability TensorABSTRACT:The aim of this paper is to provide new insights into the properties of the rank two polarizability tensor proposed by Ledger and Lionheart for describing
PROJECT TITLE :Toward Understanding Positive Bias Temperature Instability in Fully Recessed-Gate GaN MISFETsABSTRACT:During this paper, totally recessed-gate GaN MISFETs with two different gate dielectrics, i.e., plasma-enhanced

Ready to Complete Your Academic MTech Project Work In Affordable Price ?

Project Enquiry