This week I was the second IEEE Big Data conference at Santa Clara for presenting our paper “A distributed approach for graph-oriented multidimensional analysis”.
This is now a tradition, I cannot attend a conference without giving you a short summary of the trends and interesting topics of the conference.
Big Data problems: Algorithms, Machine & Infrastructure and People
We had an excellent key note given by Prof. M. Franklin AMP Lab UC Berkeley about the origin of the lab. For him, the big data problems necessarily involved three important components: (1) the algorithms, referring to machine learning and statistical approaches, (2) the infrastructure, the distributed processing and storage frameworks and, finally, (3) people, because at the end of the day we need smart guys to study your problems and solve them with the adequate algorithms and the right infrastructure approaches.
I have to say that I completely agree with this vision and this is exactly what we are building at EURA NOVA.
Highly diversified talks
It’s weird to see how much the talks were focused on their respective domains. Basically, we had talks from Machine learning guys working on significant volumes of data but having absolutely no idea (and no interest) about scalable architectures. We had talks from the scalable architecture guys trying to optimize Hadoop with more specialized solution or replacing the shuffle phase by more efficient protocols and updating the check-pointing with a lighter approach such as in this project from NEC “Takuya and al., Feliss: Flexible distributed computing framework with light-weight check-pointing.” But usually those guys are not really interested in solving big machine learning problems. However, the only few papers trying to address the scalability issues of machine learning on high volume always used MR or a central GPU approach. Which is quite limited if we compare it to frameworks such as Spark, Hyracks and Nephele (Stratosphere from TU Berlin).
Finally, there are the visualization guys who are mainly interested in having nice UI for representing very complex data.
To highlight this diversity, let me give you an interesting quote of the “Key Issues In Big Data Research panel” organized the last day. A question from the audience was “Is there any visualization library today that a data scientist can use? Is it mature?”
We had 3 answers:
- the DB panelist (coming from the DB community): visualization of high volume of data is not a problem, we must use a hierarchical approach and select the data we want to see.
- the ML panelist: well, we do not know which part of the data we are interested in. Furthermore, visualizing data with 80 dimensions is useless.
- the visualization panelist: well we have efficient methods to project different dimensions on the same space and asynchronous algorithms to load the data without blocking the UI.
This can give you a taste about the lack of coordination between the different research domains that are, however, all related to the Data Science we have to face today.
The Santa Clara convention center in which we had the conference.
Graph matters!
A lot of talks have presented cases in which graphs are involved: recommendation, bioinformatics, marketing, etc. A lot of efforts has been put on the graph algorithms and on optimizations of graph processing frameworks. Six out of the 9 papers presented in the Knowledge Management and Big Data analytics and at least 15 regular papers were focused on large graphs or graph processing frameworks.
Bioinformatics & personal medicine
It was really interesting to see how the bioinformatic researchers have bridged the Big Data science. The business reason is quite simple: the price to decode the genome of a person has highly decreased this last decade and is expected to become affordable for a lot of people soon. This open the door to a lot of fascinating use cases. This means that when you have a visit to a MD Dr, he would be able to correlate your symptoms with your particular genome definition, query clinical studies and look for any patient or group of patients presenting the same genome characteristics as you, analyze your Electronic Medical Records and try to better identify the source of your symptoms. Going further, he will be able to prescript medicine treatment that perfectly suits your genomic profile!
This also opens the door to the genomic preventive treatment. As in the “Angelina Jolie case”, your doctor will be able to correlate your current health status with your genomic profile, research & scientific studies in order to quantify the risk of an important disease. The health insurance would like to use this approach for detecting heavy diseases before they actually happen and anticipate the actions to prevent them.
There was a bunch of talks in this direction, which highly requires high performance distributed graph DB, distributed machine learning algorithms, NLP approaches and real-time query systems. That is really an amazing domain.
GPU are heavily used
A lot of machine learning talks proposed optimizations on GPU architecture by re-designing the algorithm. However, those approaches only consider a central rack of GPU all accessing to the same central memory and bunch of disks which is hardly often the case when the volume of data is significant. That is why at EURA NOVA we are currently working on a distributed GPU architecture taking the best of the two worlds.
The Data Science is fascinating and involves a lot of skills in various domains. Let me finish this post by asking you a question: “What domains do you think are important to master in order to work in the Big Data Science?”
Sabri Skhiri
Twitter: @sskhiri