For Data Analytics Applications: Toward Dependency-Aware Cache Management

PROJECT TITLE :

Towards Dependency-Aware Cache Management for Data Analytics Applications

ABSTRACT:

Memory caches are being put to intensive use in many of today's data analytics systems, including Spark, Tez, and Piccolo, among others. Cache management in data analytics clusters needs to be efficient because of the significant performance impact caches have and the limited size of caches themselves. Common data analytics systems, on the other hand, make use of rather straightforward cache management policies, such as Least Recently Used (LRU) and Least Frequently Used (LFU), which are unaware of the application semantics of data dependency and are expressed as directed acyclic graphs (DAGs). In the absence of this information, cache management can at best be performed by "guessing" the future data access patterns based on history. This frequently results in inefficient and erroneous caching that has a low hit rate and a long response time. To make matters even worse, the absence of knowledge regarding the dependencies between data makes it impossible to keep the all-or-nothing cache property of cluster applications. This means that a compute task cannot be sped up unless all of the data that is dependent on it has been kept in the main memory. In this paper, we propose a novel cache replacement policy that we call Least Reference Count (LRC). This policy makes use of the application's data dependency information in order to make cache management more efficient. The reference count of each data block is kept track of by LRC. The reference count is defined as the number of dependent child blocks that have not yet been computed. LRC will always evict the block that has the lowest reference count. In addition, we build the all-or-nothing requirement into LRC by coordinating the management of the reference counts of all the input data blocks for a single computation. This allows us to meet the requirements of the all-or-nothing requirement. We demonstrate the effectiveness of LRC by using empirical analysis in conjunction with cluster deployments that are measured against widely used benchmarking workloads. The all-or-nothing requirement is effectively addressed by the proposed policies, which also significantly improve cache performance, as demonstrated by our Spark implementation. When compared to LRU and MEMTUNE, which is a newly proposed caching policy, LRC improves the caching performance of typical workloads in production clusters by 22 and 284 percent, respectively, when compared to these two policies.

Did you like this research project?

To get this research project Guidelines, Training and Code... Click Here

For Data Analytics Applications: Toward Dependency-Aware Cache Management

QUICK LINKS

Ready to Complete Your Academic MTech Project Work In Affordable Price ?