Large applications executing on Grid or cluster architectures consisting of tons or thousands of computational nodes produce issues with respect to reliability. The source of the issues are node failures and the need for dynamic configuration over intensive run-time. This paper presents two fault-tolerance mechanisms called Theft Induced Checkpointing and Systematic Event Logging. These are clear protocols capable of overcoming problems related to each, benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, permitting for efficient recovery in dynamic heterogeneous systems also multi-threaded applications. By permitting recovery even underneath totally different numbers of processors, the approaches are especially appropriate for applications with need for adaptive or reactionary configuration control. The low-price protocols supply the capability of controlling or bounding the overhead. A formal price model is presented, followed by an experimental analysis. It is shown that the overhead of the protocol is terribly tiny and the utmost work lost by a crashed process is small and bounded.
Did you like this research project?
To get this research project Guidelines, Training and Code... Click Here