The Big Data Paris 2013 conference was held on April 3rd and 4th. I was quite disappointed by this event. Of course, I knew it was neither a scientific nor an ACM or IEEE event. I was thus coming with a solid mental preparation, ready to see marketing and business presentations.
But what I saw was incredibly far from what I expected. In order to keep this post objective, I will quickly skip the amazingly (Yes, I think this is the right term) poor level of the panel discussions and the talks. Instead, I want to summarize my overall feeling in this post.
Big Data – Big anything?
The basic idea conveyed by the different talks were:
- Big Data arrives when you have a really big amount of data (and of course the 3 “V”s of Volume, Velocity and Variety or the versions with 4, 5 or 6 “V”s) … but also when you do not have a lot!
- Big Data is also there when you want to predict something… and also when you don’t.
- Big Data is part of the BI but not really.
- Big Data is about structured data but also less structured and even non-structured one…
- And it kept going this way…
At least everybody agreed on two certainties: Hadoop is a revolution and Big Data equals Hadoop.
More interestingly, the big data session that was supposed to show real industrial usages of Big Data included talks that all started by “Huum, that is not really a big data problem here, but it can become one … one day…”
You find that incoherent and hard to hear from Big Data experts? Well, me too! Unfortunately, this was the reality of the presentations that I had to attend these two days.
Hey CTO, you don’t know yet but you have an extremely important issue with your data…
… and we have the right product to address it!
I am quite sure that this kind of event is organized for one unique objective: leading CTOs and IT director into thinking that Big Data is a real need for them and that they need to quickly buy a solution in order to stay competitive.
Not everybody needs to have an Hadoop or Spark cluster with Storm on top of it running distributed machine learning algorithms! Each project needs to clearly identify the business objectives to reach before starting talking about how Hadoop on high-end machines can help you. Making people think that buying a rack of [Insert the most expensive server of your favorite brand here] will magically resolve all their problems, is at the limit of the intellectual dishonesty!
Big Data and Data science
There was a considerable confusion, in the talks I attended, between Big data and Data science. Often, Speakers combined at the same level the storage, distributed processing and the algorithmic. In addition, they considered that
- everything is Big
- any operation (even average calculation) is a data science algorithm
Thus, some presentations concluded that their solutions (e.g. the ‘revolution’ Hadoop ) solve everything (All-in-One). We are here touching the essence of the problem: the three levels I was mentioning can be found in a data project but cannot be generalized as the Big Data Problem and then be used for selling hardware solutions.
The Big Data describes the important amount of data that we need to handle in a project. The Data Science is more about the intelligence, the algorithms we need to develop to create value from from our data (Big or Small).
What we have to keep in mind here is that business objectives are the priority. We need to think about how to reach those objectives, and this is where the Data Science can bring interesting solutions. Finally, according to the size of the data we can consider innovative architecture patterns or frameworks.
OK cool down, so what is a ‘Big Data’ Project then?
Before talking about the ‘how’ with a long description of trendy keywords, let’s talk about the ‘what’. Let’s look at a project aiming at extracting value from data:
- What do you want to do?
We first need to define the business objectives to reach: better business operations by reducing the time to deliver products to customer, producing more car per day per line of assembly, increasing the ‘click to buy’ on an ad banner, etc.
- What data do you have?
Do you collect data today? Where is it, what does it represent, what are the variables? Is there any other data we could collect?
- According to your objectives do we miss data?
Can we access it somewhere? Can we buy it or get it by an OpenData service? Do we need to plan new products to collect it? This step is quite important since it can be integrated in the product portfolio strategies driven by the global objectives your organization wants to reach.
- Designing a model and a way to calculate it:
Your particular objective will lead your data scientist team to design the right model. Depending on the objectives, it could be an analytic or statistic model, or a learning model, recommendation, etc. This can use method from statistic, machine learning, Natural Language processing, etc. This step is often iterative and never stops at the version 1. The data scientists must play, test, validate and interpret the different versions.
- Now we can start looking at the best way to implement the models and the algorithms:
We choose or designed in the previous step. According to the size of the data, its nature (highly connected as a graph or independent tuples), the algorithms to apply and the required latency, the data scientists can, in collaboration with architects, design the right infrastructure for their needs. We can consider here, BSP (Bulk Synch Processing) approaches such as MR or Data Flow Graph (as Dryad, AROM, Hyracks), new distributed in-memory approaches as Spark, Stream Processing (as Storm, S4), In-memory and in-DB approach as MadLib, SAP HANA or new version of SAS. Some cases will require re-designing some algorithm to fit the distributed nature of the processing you will select. Some case will absolutely not require this kind of approach and will just need to use existing algorithms in more traditional software such as IBM SPSS, Weka or rapidMiner.
I do not want to give you an exhaustive approach of this domain but simply highlight the fact that this is more a question of on-purpose optimizations for business objectives. Then, before thinking about the hardware and the product a customer needs, we should first think about defining its problem (if he really has one to resolve) and then we can design the best suited architecture and algorithmic. This shows the increasing importance of the Data Scientist at the core of those activities.