PROJECT TITLE :
Efficient Skew Handling for Outer Joins in a Cloud Computing Environment - 2018
Outer joins are ubiquitous in several workloads and Big Data systems. The question of a way to best execute outer joins in large parallel systems is notably difficult, as universe datasets are characterized by data skew resulting in performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little printed work solving the corresponding problem for parallel outer joins, particularly within the extremely popular Cloud computing surroundings. Standard approaches to the problem like ones based on hash redistribution typically lead to load balancing issues whereas duplication-based approaches incur significant overhead in terms of network communication. In this Project, we propose a brand new approach for economical skew handling in outer joins over a Cloud computing atmosphere. We present an efficient implementation of our approach over the Spark framework. We tend to evaluate the performance of our approach on a 192-core system with giant take a look at datasets in excess of one hundred GB and with varying skew. Experimental results show that our approach is scalable and, a minimum of in cases of high skew, considerably faster than the state-of-the-art.
Did you like this research project?
To get this research project Guidelines, Training and Code... Click Here