On Fault Tolerance for Distributed Iterative Dataflow Processing - 2017


Large-scale graph and machine learning analytics widely use distributed iterative processing. Typically, these analytics are a half of a comprehensive workflow, that includes knowledge preparation, model building, and model evaluation. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic read enables these systems to reason regarding and automatically optimize the whole pipeline. Here, graph and machine learning analytics are known to incur a long runtime since they require multiple passes over the information till convergence is reached. Thus, fault tolerance and a fast-recovery from any intermittent failure is important for efficient analysis. During this paper, we have a tendency to propose novel fault-tolerant mechanisms for graph and machine learning analytics that run on distributed dataflow systems. We tend to ask for to scale back checkpointing costs and shorten failure recovery times. For graph processing, rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner that doesn't break pipelined tasks. In contrast to the traditional approach for unblocking checkpointing (e.g., that manage checkpoints independently for immutable datasets), we tend to inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating checkpoint creation throughout iterative graph processing. Moreover, we tend to are able to rapidly rebound, via confined recovery, by exploiting the actual fact that log files exist regionally on healthy nodes and managing to avoid a whole recomputation from scratch. Furthermore, we tend to propose duplicate recovery for machine learning algorithms, whereby we tend to use a broadcast variable that enables us to quickly recover without having to introduce any checkpoints. So as to judge our fault tolerance strategies, we have a tendency to conduct each a theoretical study and experimental analyses using Apache Flink and see that they outperform blocking checkpointing and complete recovery.

