Towards Dependency-Aware Cache Management for Data Analytics Applications


Memory caches are being put to intensive use in many of today's data analytics systems, including Spark, Tez, and Piccolo, among others. Cache management in data analytics clusters needs to be efficient because of the significant performance impact caches have and the limited size of caches themselves. Common data analytics systems, on the other hand, make use of rather straightforward cache management policies, such as Least Recently Used (LRU) and Least Frequently Used (LFU), which are unaware of the application semantics of data dependency and are expressed as directed acyclic graphs (DAGs). In the absence of this information, cache management can at best be performed by "guessing" the future data access patterns based on history. This frequently results in inefficient and erroneous caching that has a low hit rate and a long response time. To make matters even worse, the absence of knowledge regarding the dependencies between data makes it impossible to keep the all-or-nothing cache property of cluster applications. This means that a compute task cannot be sped up unless all of the data that is dependent on it has been kept in the main memory. In this paper, we propose a novel cache replacement policy that we call Least Reference Count (LRC). This policy makes use of the application's data dependency information in order to make cache management more efficient. The reference count of each data block is kept track of by LRC. The reference count is defined as the number of dependent child blocks that have not yet been computed. LRC will always evict the block that has the lowest reference count. In addition, we build the all-or-nothing requirement into LRC by coordinating the management of the reference counts of all the input data blocks for a single computation. This allows us to meet the requirements of the all-or-nothing requirement. We demonstrate the effectiveness of LRC by using empirical analysis in conjunction with cluster deployments that are measured against widely used benchmarking workloads. The all-or-nothing requirement is effectively addressed by the proposed policies, which also significantly improve cache performance, as demonstrated by our Spark implementation. When compared to LRU and MEMTUNE, which is a newly proposed caching policy, LRC improves the caching performance of typical workloads in production clusters by 22 and 284 percent, respectively, when compared to these two policies.

Did you like this research project?

To get this research project Guidelines, Training and Code... Click Here

PROJECT TITLE : Systematic Analysis of Fine-Grained Mobility Prediction with On-Device Contextual Data ABSTRACT: The concept of predicting the mobility of users is widely discussed within the research community. Numerous studies
PROJECT TITLE : Objective-Variable Tour Planning for Mobile Data Collection in Partitioned Sensor Networks ABSTRACT: Wireless sensor networks can achieve greater energy efficiency and more even load distribution through the collection
PROJECT TITLE : Location-Flexible Mobile Data Service in Overseas Market ABSTRACT: Mobile network operators, also known as MNOs, are the companies that are responsible for providing wireless data services. These services are based
PROJECT TITLE : Parallel Fractional Hot-Deck Imputation and Variance Estimation for Big Incomplete Data Curing ABSTRACT: The fractional hot-deck imputation, also known as FHDI, is a method for handling multivariate missing data
PROJECT TITLE : Representation Learning from Limited Educational Data with Crowdsourced Labels ABSTRACT: It has been demonstrated that representation learning plays a significant part in the unprecedented success of machine learning

Ready to Complete Your Academic MTech Project Work In Affordable Price ?

Project Enquiry