CuWide: Towards Efficient Flow-based Sparse Wide Models Training on GPUs PROJECT TITLE : CuWide Towards Efficient Flow-based Training for Sparse Wide Models on GPUs ABSTRACT: Numerous predictive applications, such as recommendation, CTR prediction, and image recognition, have made extensive use of wide models, such as generalized linear models and factorization-based models. The performance improvement on the CPU is reaching its limit as a result of the memory bounded property of the models. The graphics processing unit (GPU), which is known to have a large number of computation units as well as a high memory bandwidth, becomes an attractive platform for the training of Machine Learning models. On the other hand, due to the sparsity and irregularity of wide models, the GPU training for these models is not even close to being the best it can be. The currently available GPU-based wide models are even more sluggish than those that are processed by the CPU. The traditional training schema for wide models is not optimized for the GPU architecture, so it generates a large number of random memory accesses and performs redundant reads and writes of intermediate values. This is a problem for the GPU because it suffers from these issues. In this article, we propose a GPU-training framework for large-scale wide models that we call cuWide. It is both effective and efficient. cuWide uses a new flow-based schema for training, which takes advantage of the spatial and temporal locality of wide models to drastically cut down on the amount of Communication with GPU global memory. This allows cuWide to derive maximum benefit from the memory hierarchy of the GPU, which is accomplished by using cuWide. In order to accomplish this, we use a bigraph computation model to effectively realize the flow-based schema, and we take advantage of three flexible programming interfaces. To further optimize GPU memory access for sparse data, we use a 2D partition of mini-batch (in sample and feature dimensions) in conjunction with a proposed graph abstraction. Additionally, we implement several spatial-temporal caching mechanisms (importance-based model caching and cross-stage accumulation caching mechanisms) in order to achieve a high performance kernel. We also propose several GPU-oriented optimizations as a means of effectively implementing cuWide. These include a feature-oriented data layout as a means of improving data locality; a replication mechanism as a means of reducing update conflicts in shared memory; and multi-stream scheduling as a means of overlapping data transfer and kernel computing. We demonstrate that cuWide is capable of being up to more than 20 times faster than the most cutting-edge GPU solutions and multi-core CPU solutions. Did you like this research project? To get this research project Guidelines, Training and Code... Click Here facebook twitter google+ linkedin stumble pinterest A Survey on Database and Artificial Intelligence Multiview Sequential Data Modeling with Conditional Random Fields