PROJECT TITLE :
Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies - 2018
As the dimensions of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as serious style concerns. Efficiently running systems at such giant scales critically relies on deploying effective, practical methods for fault tolerance while having a smart understanding of their respective performance and energy overheads. The most typically used fault tolerance technique is checkpoint/restart. Checkpoint scheduling policies, but, are historically optimized and analysed from one angle: application performance. In this work, we tend to provide an extensive analysis of the performance, energy and i/O prices related to a wide array of checkpointing policies. We have a tendency to contemplate practical deployment problems and show that simple formulas can be used to accurately estimate wasted work in an exceedingly system. We propose methods to optimize checkpoint scheduling for energy savings and evaluate the runtime-optimized and energy-optimized policies using simulations based on failure logs from ten production HPC clusters. Our results show ample space for achieving high quality energy/performance tradeoffs when using ways that exploit characteristics of world failures. We have a tendency to also analyze the impact of energy-optimized checkpointing on the storage subsystem and establish policies that are optimal for I/O savings.
Did you like this research project?
To get this research project Guidelines, Training and Code... Click Here