Third Workshop on Real-Time and Stream Analytics in Big Data: key takeaways

Last month, EURA NOVA research centre organised the third workshop on real-time and stream analytics in big data, collocated with the 2018 IEEE conference on big data in Seattle. The workshop brought together the leading actors in the field including data Artisans, the University of Virginia and Télécom Paris Tech as well as 9 well-known speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in data architecture, stream mining, complex event processing and IoT.

The workshop was a real success, with captivating talks and a lot of interesting questions and comments. If you could not attend the event, our R&D engineer Syrine Ferjaoui has brought back for you the important elements from the keynotes and the presented papers.

 

First keynote speaker:

First of all, the workshop started with the keynote of Fabian Hueske, PMC member at Apache Flink & co-founder of data Artisans. His talk “Unified Processing of Static and Streaming Data with SQL on Apache Flink” presented Flink’s features and its relational unified APIs for batch and streaming data. Fabian Hueske insisted on the importance of unifying stream and batch for 2 major points: the usability and the portability. Flink includes a set of features such as materialised views to speed-up the analytical queries, dynamic tables, updates propagation and processing, continuous queries, approaches to handle time in stream processing, watermarks and queries on infinite sized tables. With all these features, Flink helps its users to build data pipelines with low-latency ETL, stream & batch analytics and to power live dashboards.

Our research director Sabri Skhiri adds: “Apache flink is currently working on a set of connectors. They have already the HDFS sink, the JDBC sink and since they are pushing Flink as the standard technology for data pipelines and materialised views, they want to expand their connectors set.”

 

Second keynote speaker:

Secondly, our research director Sabri Skhiri talked about data management, and stream and real-time analytics. His talk “The challenge of Data Management in the Big Data Era & its underlying Enterprise architecture shift” started with defining data architecture as a global plan depicting how to collect, store, use and manage data to answer the 8 main challenging questions that are essential to building a solid and efficient solution. During his talk, our director considered deriving microservices from data streams as the new wave of architecture and he discussed the Data Architecture Vision (DAV) set throughout 10 years of research and development at EURA NOVA. The DAV later led to the development of digazu, a data engineering platform containing all the different components needed to collect, store, govern, transform, and analyse all the data in the company’s IT environment.

 

Workshop Invited Speakers:

After the keynotes, 9 selected papers have been presented, covering mainly these 4 topics: (1) Data Streaming Architecture, (2) CEP/CER, (3) Stream Mining & (4) IoT Device Integration:

Isah and Zulkernine (Queen’s University, Kingston, Canada) propose a scalable and fault-tolerant data stream ingestion and integration framework that can serve as a reusable component across many feeds of structured and unstructured input data in a given platform. Our R&D engineer Syrine Ferjaoui explains: “The ingestion layer (that integrates Apache NiFi and Kafka) is used to decouple streaming analytics layers (acquire, buffer, pre-process, distribute data streams).  This NiFi-Kafka “NiFKaf” integration takes advantage of the high configuration of NiFi and the addition of several data of consumers provided by Kafka.This way, it supports many data sources, languages and content formats, ensures high throughput and low latency, supports large numbers of data consumers, enables data buffering during temporary spikes in workload and employs a replay mechanism, and is scalable”.

 

The paper by Trinks & Felden (TU Bergakademie Freiberg, Germany) presents Edge Computing which is an extended approach to cloud computing. It describes an architecture scheme that consists of 3 layers: node layer (gadgets, smartphones, embedded systems, sensors), edge layer (routers, switches, small/macro base station) and cloud layer (datacenters, servers, databases, storages). Edge Computing is used to minimise energy consumption, bandwidth, latency and increase safety and privacy level and employs real-time analytics within its architecture.

 

Link prediction refers to the likelihood of a link appearing in the future based on the current status of a graph. The previous works for link prediction such as sketch-based approaches and dynamic attributed networks do not give exact results and cannot handle deletion or modification in the graph nor the large volume of data. The goal of the authors (University of Louisiana, USA) is to design a graph-processing approach for link prediction that ensures real-time prediction and extraction of accurate features from the graph with exact results. Syrine details: “Graph processing can be edge-centric, vertex-centric or neighbourhood centric. This paper proposed two new graph processing frameworks for handling each graph streams: vertex-centric processing & neighbourhood-centric processing. These frameworks are able to predict 100% of the links with an average graph ingestion time between [149.3 – 242.7] ms”.

 

Researchers from the University of New Mexico have developed a robust distributed matching system, called DisPatch. In a scenario where multiple data sources or producers publish data to the Kafka system, DisPatch is the data consumer that matches a pattern with a guaranteed maximum delay after the pattern appears in the stream. Syrine reacts: “Given a time series T of length n, and a query Q of length m, it normally takes O(nm) to calculate the Euclidean distance/correlation between Q and all subsequences of T, but this method calculates the results in O(log(n)) by exploiting the overlaps. As a result, DisPatch guarantees exactness and bounded delay at the same time”.

 

In this paper, the authors (Adobe Research, California, USA) discuss Adobe’s  Identity Graph that provides a comprehensive solution to the challenge posed by fragmentation of identities. Our R&D engineer details: “Identity graph helps in connecting data across channels, domains and devices to solve a fundamental problem in the Digital Marketing domain. The fragmented profiles of a consumer are linked together in order to provide a unified view across devices. This means that an identity graph connects all the known identifiers that correlate with the individual consumer. The researchers built identity relationships by using both online data traffic and offline CRM data logs from customer’s backend systems. To do that, they are using two approaches: deterministic linking and probabilistic linking. They combined them using deterministic as a base and expanding using probabilistic clusters”.

 

The authors (Purdue University, USA) propose a novel fitting algorithm for big data logistic regression by combining Fisher Scoring and IRWLS. Syrine details: “The revised IRWLS algorithms can break the memory barrier and is suitable for streamed computing. It is per row updatable and does not need to load the whole dataset into the memory. This algorithm has a fast convergence speed (usually around 3). The limitation of this method is the structured data with large n (rows) and small p (columns)”.

 

Dynamic Time Warping (DTW) is able to match natural time series with similar shapes, but a different length of patterns. The authors (Linnaeus University, Sweden) described enhancements to the DTW algorithm that allow it to be used efficiently in a streaming scenario. Syrine explains: “Their solution is composed of 3 parts: (1) a very fast implementation of the DTW (2) an append operation for the DTW which works in linear or constant time and (3) an approximation of a sliding window that allows DTW to forget old time steps, improving the processing of “never-ending” streams. In short, DTW encapsulates all data behaviour information in a single value and enables the use of a tiny fraction of data compared to the original sensed data while still obtaining highly accurate results”.

 

There is a rapid emergence of new applications involving mobile wireless sensor networks (MWSN) in the field of Internet of Things (IoT). Although useful, MWSN still carry the restrictions of having limited memory, energy, and computational capacity. At the same time, the amount of data collected in the IoT is exponentially increasing.The authors (Florida International University, USA) propose a Behavior-Based Trend Prediction (BBTP), which is a data abstraction and trend prediction technique, designed to adress the limited memory constraint in addition to providing future trend predictions. Predictions made by BBTP can be employed by real-time decision-making aplications and data monitoring.

 

Lightweight Temporal Compression (LTC) is among the lossy stream compression methods that provide the highest compression rate for the lowest CPU and memory consumption. As such, it is well suited to compress data streams in energy-constrained systems such as connected objects. In this paper, Li, Sarbishei, Nourani and Glatard (Concordia University &  Motsai Research, Canada) investigate the extension of LTC to higher dimensions. Syrine adds: “They described how multi-dimensional LTC compression saves substantial amounts of energy (up to 20%) and is feasible on connected objects. The implementation with Euclidean norm is more intuitive than infinity norm for nD sensors, as well as more CPU & memory intensive and leads to lower compression ratios”.

 

Special thanks to our keynote speaker Fabian Hueske, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!