PROJECT TITLE :
Fault Tolerant Stencil Computation on Cloud-based GPU Spot Instances - 2017
ABSTRACT:
This paper describes a fault tolerant framework for distributed stencil computation on cloud-based GPU clusters. It uses pipelining to overlap the information movement with computation in the halo region in addition to parallelises knowledge movement among the GPUs. Instead of running stencil codes on traditional clusters and supercomputers, the computation is performed on the Amazon Web Service GPU cloud, and utilizes its spot instances to enhance cost-efficiency. The implementation is based on a low-value faulttolerant mechanism to handle the doable termination of the spot instances. Including a price bidding module, our stencil framework not only optimizes for performance however also for price. Experimental results show that our framework outperforms the state-of-the-art solutions achieving a peak of twenty five TFLOPS for two- D decomposition running on 512 nodes. We also show that the use of spot instances yields sensible cost-potency, increasing the common TFLOPS/USD from 132 to 360.
Did you like this research project?
To get this research project Guidelines, Training and Code... Click Here