Speculative execution optimization in heterogeneous Spark environments PROJECT TITLE : Optimizing Speculative Execution in Spark Heterogeneous Environments ABSTRACT: In computing environments that use Spark, a few tasks that run more slowly than others can extend the total amount of time it takes to complete a stage. Spark makes use of a speculative execution mechanism in order to combat the so-called straggler problem. Within this framework, the scheduler will speculatively launch additional backup for the straggler in the hope that it will finish earlier than expected. However, due to the nature of the tasks and the complexity of the runtime environments, the original speculative execution strategy of Spark and its improved versions are unable to effectively deal with this issue. To enhance the effectiveness of speculative execution in Spark, our team has developed a novel strategy that we refer to as ETWR. When we are going around to tackle the three key points of speculative execution, which are straggler identification, backup node selection, and effectiveness guarantee, we take into consideration the heterogeneous environment. First, we divide the task into sub-phases based on the task type classification, and then, within each phase, we use both the process speed and the progress rate to locate the straggler as quickly as possible. Second, we make an estimate of the amount of time the task will take to execute by employing the Locally Weighted Regression model. This estimate can then be used to compute the amount of time that is still left on the task as well as the amount of time needed for backup. Third, in order to guarantee the efficiency of speculative tasks using a model that can also maintain load balancing for nodes, we present the iMCP model. In conclusion, when selecting reliable backup nodes, it is important to take into account the factors of fast node and better location. Extensive testing demonstrates that ETWR is capable of reducing the amount of time needed to complete a job by 23.8 percent and increasing the amount of data that can be processed by the cluster by 33.2 percent in comparison to Spark-2.2.0. Did you like this research project? To get this research project Guidelines, Training and Code... Click Here facebook twitter google+ linkedin stumble pinterest Laplacian Matrix for Graph Filter Banks with Oversampled Graphs Big Data Ontology-Based Privacy Data Chain Disclosure and Discovery Method