PROJECT TITLE :

CuWide Towards Efficient Flow-based Training for Sparse Wide Models on GPUs

ABSTRACT:

Numerous predictive applications, such as recommendation, CTR prediction, and image recognition, have made extensive use of wide models, such as generalized linear models and factorization-based models. The performance improvement on the CPU is reaching its limit as a result of the memory bounded property of the models. The graphics processing unit (GPU), which is known to have a large number of computation units as well as a high memory bandwidth, becomes an attractive platform for the training of Machine Learning models. On the other hand, due to the sparsity and irregularity of wide models, the GPU training for these models is not even close to being the best it can be. The currently available GPU-based wide models are even more sluggish than those that are processed by the CPU. The traditional training schema for wide models is not optimized for the GPU architecture, so it generates a large number of random memory accesses and performs redundant reads and writes of intermediate values. This is a problem for the GPU because it suffers from these issues. In this article, we propose a GPU-training framework for large-scale wide models that we call cuWide. It is both effective and efficient. cuWide uses a new flow-based schema for training, which takes advantage of the spatial and temporal locality of wide models to drastically cut down on the amount of Communication with GPU global memory. This allows cuWide to derive maximum benefit from the memory hierarchy of the GPU, which is accomplished by using cuWide. In order to accomplish this, we use a bigraph computation model to effectively realize the flow-based schema, and we take advantage of three flexible programming interfaces. To further optimize GPU memory access for sparse data, we use a 2D partition of mini-batch (in sample and feature dimensions) in conjunction with a proposed graph abstraction. Additionally, we implement several spatial-temporal caching mechanisms (importance-based model caching and cross-stage accumulation caching mechanisms) in order to achieve a high performance kernel. We also propose several GPU-oriented optimizations as a means of effectively implementing cuWide. These include a feature-oriented data layout as a means of improving data locality; a replication mechanism as a means of reducing update conflicts in shared memory; and multi-stream scheduling as a means of overlapping data transfer and kernel computing. We demonstrate that cuWide is capable of being up to more than 20 times faster than the most cutting-edge GPU solutions and multi-core CPU solutions.


Did you like this research project?

To get this research project Guidelines, Training and Code... Click Here


PROJECT TITLE : ESVSSE Enabling Efficient, Secure, Verifiable Searchable Symmetric Encryption ABSTRACT: It is believed that symmetric searchable encryption, also known as SSE, will solve the problem of privacy in data outsourcing
PROJECT TITLE : Millimeter-Wave Mobile Sensing and Environment Mapping Models, Algorithms and Validation ABSTRACT: One relevant research paradigm, particularly at mm-wave and sub-THz bands, is to integrate efficient connectivity,
PROJECT TITLE :Efficient, Non-Iterative Estimator for Imaging Contrast Agents With Spectral X-Ray DetectorsABSTRACT:An estimator to image contrast agents and body materials with x-ray spectral measurements is described. The estimator
PROJECT TITLE : Video Dissemination over Hybrid Cellular and Ad Hoc Networks - 2014 ABSTRACT: We study the problem of disseminating videos to mobile users by using a hybrid cellular and ad hoc network. In particular, we formulate
PROJECT TITLE : Secure and Efficient Data Transmission for Cluster-Based Wireless Sensor Networks - 2014 ABSTRACT: Secure data transmission is a critical issue for wireless sensor networks (WSNs). Clustering is an effective

Ready to Complete Your Academic MTech Project Work In Affordable Price ?

Project Enquiry