These past years the Graphics Processing Units have gained popularity and momentum in industrial research thanks to the parallel processing power (# of parallel core) that these offers for a reasonable price. The second reason behind this is their wide hardware availability in the consumer landscape: the sheer price cost of a GPU has favored their presence in most of the consumer devices nowadays, from mobile phones & tablets to video game systems.
In the beginning of 2013 EURA NOVA has started an internal R&D project codenamed “SNIFF” whose goal is to study the potential of GPUs in a distributed machine learning framework. This consideration adds a second level of parallelism:
- in the distribution of the processing among the computing nodes.
- in the parallel computation on the individual nodes.
Concretely this means that each node in the data processing cluster is equipped with a GPU suitable for massive parallel computation in addition to its main Central Processing Unit (CPU).
Both these challenges have endured extensive research these past years with on one hand the development of distributed processing frameworks such as MapReduce, Spark, etc. EURA NOVA also has also worked on AROM which is based on the DFG execution model. On the other hand, the proliferation of affordable parallel hardware (GPUs, manycores, …) has driven much momentum in multi-core research. More precisely when considering the context of Machine Learning, these challenges translate to the following:
- Which Machine Learning algorithms are suitable for distribution and how can this be achieved?
- Which sections of these algorithms can be executed in parallel and leverage the parallel processing power provided by the GPU?
The first problem above has been studied these past years following the developments of MapReduce, while for the second, many effort has been invested in porting fundamental machine learning tasks to run on GPUs (ex: ).
In the focus of the study we have focused on two fundamental machine learning algorithms: the naive Bayes classifier and the k-means clustering. Both these algorithms are data parallel and there has already been many parallelization propositions in the litterature for either distributed or parallel computation of the model using the GPU, but not both at the same time. In the project we have reeingineered these algorithms to run distributed and we have delegate many parallel sections to be computed on the GPU.
The following figure summarizes the architecture of the framework. Given a suited machine learning algorithm, the workflow of the process is as follows:
- The global processing is distributed among worker nodes
- Each node computes a part of the model, using the parallel hardware (GPU)
- A combiner then gathers the partial models and combines them into the final global model
- (A master node is used for signal and control)
Design of SNIFF-see details in text
Going one step further, our research has given us enough insight to project the future of distributed processing framework as distributed heterogeneous processing platforms. When considering a framework such as AROM this means that the particularities of the parallel hardware are hidden behind operators, which then pack different implementations of a same task with and without requiring the presence of a GPU depending on the capabilities of the host on which it will be scheduled on. For the user in the end, writing the processing job means composing with the different operators which will handle the execution on the GPU hardware.
 C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Y. Yu, G. Bradski, A. Ng, and K. Olukotun, “Map-Reduce for Machine Learning on Multicore”, Advances in Neural Information Processing Systems, vol. 19, pp. 281—288, 2007.
 L. Lopes and B. Ribeiro, “GPUMLib: An Efficient Open-Source GPU Machine Learning Library”, International Journal of Computer Information Systems and Industrial Management Applications, vol. 3, pp. 355–362, 2010.