Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets - 2018 PROJECT TITLE :Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets - 2018ABSTRACT:During this Project, we have a tendency to study the matter of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, that is hidden by HDFS, will typically cause corresponding analyses to suffer from a seriously imbalanced or inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some computational nodes carrying out much more workload than others; furthermore, it results in inefficient sampling of sub-datasets, as analysis programs can typically browse massive amounts of irrelevant data. We have a tendency to conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling occur. We have a tendency to then propose a storage distribution aware technique to optimize sub-dataset analysis over distributed storage systems referred to as DataNet. First, we tend to propose an economical algorithm to get the meta-knowledge of sub-dataset distributions. Second, we tend to design an elastic storage structure called ElasticMap based mostly on the HashMap and BloomFilter techniques to store the meta-information. Third, we have a tendency to employ distribution-aware algorithms for sub-dataset applications to attain balanced and economical parallel execution. Our proposed method can profit completely different sub-dataset analyses with varied computational necessities. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and also the results show the performance edges of DataNet. Did you like this research project? To get this research project Guidelines, Training and Code... Click Here facebook twitter google+ linkedin stumble pinterest Smart Monitoring Cameras Driven Intelligent Processing to Big Surveillance Video Data - 2018 Towards Max-Min Fair Resource Allocation for Stream Big Data Analytics in Shared Clouds - 2018