Skip to content

IEEE Big Data 2015

This year we had the opportunity to publish a paper, DISTRIBUTED FRANK-WOLFE UNDER PIPELINED STALE SYNCHRONOUS PARALLELISM, at the IEEE Big Data conference at Santa Clara, CA. This was an excellent opportunity to write a short summary on the trends in the big data area and our personal feelings after one week under the sun with Tacos and Enchiladas.

Who was there?

Last year, we could see data scientists presenting advanced algorithms in Neural Networks or graph analytics and trying to scale their approaches, researchers from the infrastructure layer trying to provide the best playground for machine learning algorithms and finally business domain people, especially from science, healthcare and clinical, trying to find solutions to their business domain in the state of the art of distributed infrastructure and machine learning.

Difference 1: it seems that data scientists have left for other conferences such as ICML. There was no real expert data scientists showing real contributions to machine learning. Instead, we have seen a lot of people using simple or even naive machine learning approaches in a large scale setting.

Difference 2: much more business domain researchers and industrials were present, which is a good sign of the penetration of big data in business. We saw newcomers from the automotive, manufacturing, telecom, marketing and e-commerce industries. The most surprising thing was the different levels of expertise and contribution we could find in those papers. In marketing, for instance, we saw naive and simplistic examples [12] but also very interesting approaches such as Ebay [1] and Groupon. The same in the manufacturing area and predictive maintenance, we saw a very cool talk from Bosch [2] telling how they predict whether a part should continue the industrial process or if they can detect early defects. It is a challenging issue as they need to have an accuracy greater than 0.85 (AUC) to be economically viable, and moreover, they suffer a lot from unbalanced data sets since they only have 0.25% of defects. That was interesting to see (1) their approach but also (2) the economy of the prediction and the business model they built.

The healthcare industry showed expert data scientists tackling complicated problems and describing smart solutions. For instance [3] showed the use case of predicting patient readmission within 30 days. They described and justified why traditional machine learning methods can not fit.

Trends & important topics

There were different domains exposing their use cases. I have tried to summarize the most important, at least for me.

Marketing
Persona groupon, targeting [4]. Beyond the traditional cases about large scale recommendation systems leveraging text mining, collaborative filtering and tackling the cold-start issue, we saw also new interesting directions such as the Groupon’s talk about the personas. A persona is an archetype of a group representing the key characteristic of the group. This is heavily used by the marketing teams to adapt the campaign and its content. In this case, Groupon wanted to learn the different personas and their lifestyle from the deals. The objective was (1) to build a complete user preference profile from the deals, (2) inferring lifestyle based on collaborative filtering and clustering, and finally (3) building the persona.

Knowledge base construction
There was a couple of papers which aimed at building an automatic knowledge base in order to be used later by machine learning algorithms. There was a really good example given by eBay [5]. They needed to provide post-purchase recommendations. But the problem is that they first need to build a pool of products they could recommend after such a purchase. For instance, if you buy a Samsung phone, you would like to buy the Samsung cover. So, they first need to build a knowledge base of potential post-purchase products to suggest. This is a challenging task, knowing that the products on ebay are briefly described by the seller with a short text.

Another similar issue was given by CarreerBuilder [13] who wanted to parse a huge amount of resume and build a knowledge base about skills and employers. This knowledge base can be further used for information retrieval or recommendation of jobs/profiles. However, we can largely discuss and comment their approach. Indeed, they extract employer names and skills from resume and they use Wikipedia and freebase to retrieve the corresponding name and exploit their categorisation to create links. This method heavily depends on (1) the language and (2) the presence of a 3rd party data base such as wikipedia, which is not the case in all countries.

CIM/Data Governance:
[14] It was amazing to see that in different domains, whether it is in marketing, manufacturing, healthcare, etc… there was a common issue met by data scientists: among the thousands of terabytes and hundreds of datasets, how can I navigate efficiently to this significant mass of data in order to build my data set of interest. There was a really inspiring paper from GE showing how they build a semantic layer on top of the test data they generate from turbine test. This semantic layer is an ontology that expose a SPARQL interface which is pre-compiled to extract data stored in the triple store and time series stored in HIVE. Even if we can argue the presented-architecture, they implemented what we call a CIM (Common Information Model) in enterprise architecture. This CIM is therefore used by data scientists to navigate through the data warehouse and select/build the data set of interest.

Industry 4.0:
This is a term I have heard at the conference for the first time. It is about leveraging data analytics when a part is designed or used in order to improve the overall productivity, efficiency and value.

There was an interesting key note presentation [6] describing a new kind of emerging business model: A manufacturer A builds a machine that is used by manufacturer B to design and produce a product. A can not only sell the machine but also provides data analytics services in order to improve the productivity and efficiency of the manufacturer B.

Let me introduce you a few cases to illustrate the concept of the Industry 4.0

  • Quality tests: According to a set of production parameters and a set of part attributes, predict whether the part will pass the quality test. Find the most important process parameter influencing the quality.
  • Predictive maintenance: according to the continuous monitoring of a set of components and from a predictive model knowledge base, detect in advance when I am going to have a failure.
  • Process Optimization: From the CAD system where I have designed my part, take all the commands that are going to be sent to the chain and predict the energy that will be consumed. Going further, as a compiler would do with a cost model, rewrite fewer commands with similar commands having a less important energy impact.

This is really a challenging and very interesting domain which is raising.

Spatial big data management [7]:
Spatial and temporal data is storing data about location, and any kind of data of interest such as sales, temperature, crowding level, trajectories, etc… Over the past 20 years, there has been a lot of work in the Spatio-temporal data base. By the way, I have to say that Belgium has some of the greatest academic leader about the spatio-temporal data warehouse. However, these existing systems are limited in term of size. On the other hand, we receive more and more location-based information evolving over time. This new movement aims at building a large scale spatio-temporal analytical system that rethinks

  • The way the data is stored and indexed
  • How all existing operators can be implemented in big data technologies but also extended. For instance, how can we join different spatial and temporal information on queries, that are not stored together.
  • How to compare large trajectories with regard to different spatio-temporal attributes (point of interest, weather, time, past activities, etc.)
  • What kind of language should be exposed
  • What kind of new visualisation system must be developed
  • The new types of prediction: sales, behaviors, target location, etc…

Even if this domain has been largely studied, the big data aspect changes a lot of things and creates a room for a lot of exploration work. This domain opens new doors in location-based predictions and services such as subscriber trajectory management for realtime marketing, targeting, lifestyle inference and categorisation, etc…

GPU: APU-HSA

There was a very exciting tutorial from AMD [8] about Heterogenous System Architecture. In a basic way, the idea is to use a distributed CPU-based system and augment it with GPUs. The speaker presented HadoopCL(http://dl.acm.org/citation.cfm?id=2511015) that was able to compile MR job in GPU kernels. This is a bit similar to what we have done at EURA NOVA when we presented a distributed pipeline engine similar to Flink augmented with GPU and how we rewrote a specialized scheduler to deal with heterogeneity. AMD also presented their new APU (Accelerated Processing Unit) which is nothing else than a CPU and a GPU on the same chip. They have also shown a set of different workloads that illustrated the relevance of APU.

Infrastructure optimisation (scheduling, partitioning, SSP)

That is clearly a trend: how to optimise the big data processing, whether it is at the scheduling layer (using DAG graph pre-processing), data partitioning layer (especially for graph processing with condensed spanning tree), stream prediction with gaussian processes in order to better adapt the operator parallelism and many others.

Tutorials

Last year I claimed that tutorials were the best way to get in few hours a complete overview of the state of the art in a specific domain. This is true provided that the content is good. This year I was disappointed by the level of some of the tutorials: The predictive maintenance [9] and the big data platform [10] were very basic, while the tutorial on Big spatial data [7], GPU and HSA[8] and Graph were quite good. The Big Graph Tutorial [11] was just a showcase of the IBM System G with a set of completely irrelevant demos and use cases. When the speaker is an expert in his field, the tutorials still remain a good way to efficiently learn a lot of information !

References

[1] Chandra Khatri, Suman Voleti, Sathish Veeraraghavan, Nish Parikh, Atiq Islam, Shifa Mahmood, Neeraj Garg, and Vivek Singh, Algorithmic Content Generation for eBay Products

[2] Scott C. Hibbard,  Big Data and Industry 4.0, keynote presentation, SPECIAL SESSION I: From Data to Insight: Big Data and Analytics for Advanced Manufacturing Systems

[3] Dr. Chandan Reddy, Transfer Learning and Survival Regression Methods for Patient Risk Prediction (Keynote), keynote presentation, Deriving Value from Big Data in Healthcare Workshop at the 2015 IEEE Big Data Conference

[4] Kang Li, Vinay Deolalikar, and Neeraj Pradhan, Mining Lifestyle Personas at Scale in E-commerce

[5] Chandra Khatri, Suman Voleti, Sathish Veeraraghavan, Nish Parikh, Atiq Islam, Shifa Mahmood, Neeraj Garg, and Vivek Singh, Algorithmic Content Generation for eBay Products

[6] Jay Lee,  Industrial Big Data and Predictive Analytics for Future Smart Systems, keynote presentation, SPECIAL SESSION I: From Data to Insight: Big Data and Analytics for Advanced Manufacturing Systems

[7] Mohamed F. Mokbel, Ahmed Eldawy, Tutorial 2: The Era of Big Spatial Data

[8] Mayank Daga, Mauricio Breternitz, Junli Gu, Tutorial 1: Optimization Big Data Analytics on Heterogeneous Processors

[9] Zhuang Wang, Tutorial 6: Tutorial on Predictive Maintenance

[10] Chandan K. Reddy, Tutorial 3: Platforms and Algorithms for Big Data Analytics

[11] Toyotaro Suzumura, Ching-Yung Lin, Yinglong Xia, Lifeng Nai, Tutorial 5: The World is Big and Linked: Whole Spectrum Industry Solutions towards Big Graphs

[12] Xiuqiang He, Wenyuan Dai, Guoxiang Cao, Huyang Sun, Mingxuan Yuan, and Qiang Yang, Mining Target Users for Online Marketing based on App Store Data

[13] Mayank Kejriwal, Qiaoling Liu, Ferosh Jacob, and Faizan Javed, A Pipeline for Extracting and Deduplicating Domain-Specific Knowledge Bases

[14] Jenny Williams, Paul Cuddihy, Justin McHugh, Kareem Aggour, and Arvind Menon, Semantics for Big Data Access & Integration: Improving Industrial Equipment Design through Increased Data Usability

Releated Posts

Insights From Flink Forward 2024

In October, our CTO Sabri Skhiri attended the Flink Forward conference, held in Berlin, which marked the 10-year anniversary of Apache Flink.  This event brought together experts and enthusiasts in the
Read More

Internships 2025

You are looking for an internship in an intellectually-stimulating company? are fond of feedback and continuous personal development? want to participate in the development of solutions to address tomorrow’s challenges?
Read More