Skip to content

Fourth Workshop on Real-Time and Stream Analytics in Big Data: key takeaways

Last December, Eura Nova’s research center held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0.

The workshop was a real success with many interesting questions and comments. If you could not attend, our R&D engineer Syrine Ferjaoui brought back important elements from the presentations for you.

 

First keynote speaker:

First of all, the workshop started with the keynote of Matteo Merli, PMC member at Apache Pulsar. His talk “Messaging and Streaming” explained how Pulsar can be a unified infrastructure that supports messaging and streaming.

Matteo introduced messaging as events that are being created and streaming as analysing events that just happened. These are two different processing concepts but they need a single infrastructure. He then explained the architecture view of Pulsar, which has separate layers between the brokers and the bookies (BookKeeper instances that handle persistent storage of messages). This means that brokers and bookies can be added independently, traffic can be shifted very quickly across brokers, and new bookies will ramp up on traffic quickly. This segmented distribution makes the architecture of Pulsar more flexible and dynamic.

Pulsar has other interesting features such as durability, low latency, high throughput, high availability, unified messaging model, high scalability, native computing, … The roadmap includes working on Pulsar storage API to allow direct access to data stored in Pulsar and to retrieve and process data more efficiently. They are also working on higher-level messaging features.”

 

Second keynote speaker:

The second keynote was given by John Roesler, a Kafka committer at Confluent. He talked about Kafka Streams and the evolution of streaming paradigms.

To design software, we, developers, used to separate the application logic from the database. To scale the database capacity, we then started to use a search index to do ETL jobs and query the database in a fast and optimal way. However, this created bugs in the software, added data consistency issues, and created more complexity in the system. Later, we started to use HDFS for a more flexible design. While enabling replication and distributed storage, this solution added more latency and supported batch processing only. It did not meet the needs of real-time processing use cases.

At this point, streaming helped a lot. The next step was to add a streaming platform that reads from sources, does some computation, and sinks the result somewhere else. The KafkaStreams design is a set of multiple lambda stateful functions, which makes it a good fit for a microservices architecture.  With Kafka Streams’ new updates, the app logic is linked to a relational database with ACID guarantees.

Finally, John Roesler considers that “software is a fractal”, a never-ending pattern: a software architecture is complex and even when we zoom into a single component, it is still complex. But for the Kafka Streams’ design, when we zoom out, it looks like a set of services interacting and connected to each other and this simplifies the aforementioned designs.

John concluded by mentioning open problems that can be dealt with in stream processing, including semantics, observability, operability, and maintainability.

 

Workshop Invited Speakers:

After the keynotes, 8 selected papers were presented, covering mainly these 6 topics: (1) Stream Processing for IoT, (2) Serverless and HPC (High Performance Computing), (3) Collaborative Streaming, (4) Stream Mining, (5) Image Mining and (6) Real-time Machine Learning. Some papers are not yet available, as they will be published in the proceeding of the IEEE Big Data Conference. In the meantime, do not hesitate to contact our R&D department at research@euranova.eu to discuss how you can leverage stream processing in your projects.

Sören and Wilhelm are engineers in the Software Engineering Group from Kiel University. They propose a stream processing architecture which allows for aggregating sensors in hierarchical groups, supports multiple hierarchies in parallel, provides reconfiguration at runtime, and preserves the scalability and reliability qualities of streaming.

Andre Luckow, head of Blockchain and Emerging Technologies at BMW Group, and Shantenu Jha, associate professor at Rutgers University, presented StreamInsight, which provides insight into the performance of streaming applications and infrastructure, their selection, configuration, and scaling behaviour.

The paper is written by Tobias Grubenmann, researcher at The University of Hong Kong, in collaboration with Daniele Dell’Aglio and Abraham Bernstein, researchers at the University of Zurich. They present the Collaborative Stream Processing (CSP), a model where the costs, which are set exogenously by providers, are shared between multiple consumers, the collaborators. For this, they identify the important requirements for CSP to establish trust between the collaborators and propose a CSP algorithm adhering to these requirements.

  • Kennard-Stone Balance Algorithm for Time-series Big Data Stream Mining (Tengyue Li, Simon Fong, and Raymond Wong)

Tengyue Li and Simon Fong (researcher and associate professor at the University of Macau, China) and Raymond Wong (associate professor at UNSW Sydney) worked on the Kennard-Stone Balance algorithm used as a new data conversion method. Training a prediction model effectively using big data streams poses certain challenges in machine learning. In this paper, the authors apply the Kennard-Stone algorithm on time-series to extract a meaningful representation of big data streams, which improves the performance of a machine learning model.

 

  • Assessing the Effects of TV Ad Events on Digital Search: On the Selection of Outcome Measures (Shawndra Hill, Anthony Colas, H. Andrew Schwartz, and Gordon Burtch)

Shawndra Hill (Microsoft), Anthony Colas (University of Florida), H. Andrew Schwartz (State University of New York at Stony Brook) and Gordon Burtch (University of Minnesota) explained their work on the interactions between TV content and online behaviours such as response to digital advertising. They developed AdMiner, a tool that can track online activity around a brand and provide actionable insights into ad campaigns.

 

Austin Harris, Jose Stovall, and Mina Sartipi (researchers and CUIP director at the University of Tennessee at Chattanooga) have helped to create Chattanooga’s smart corridor, used to test new technologies and generate data-driven outcomes. In their talk, they presented the corridor, used as a test bed for research in smart city developments in a real-world environment. The wireless communication infrastructure and network of sensors in combination with data analytics provide a means of monitoring and controlling city resources and infrastructure in real time.

 

Sebastian Trinks and Carsten Felde (TU Bergakademie Freiberg) presented how image mining can help avoiding errors and low quality of printed prototypes in real time. This can result in saving resources and increasing efficiency when developing new products.

 

This year, IEEE Big Data held the Real-time Machine Learning Competition on Data Streams. As the competition is focused on streaming, its online platform required a specific infrastructure that meets data stream mining requirements. Dihia Boulegane is a Ph.D. student at Télécom ParisTech working in collaboration with Orange Labs on machine learning for IoT networks monitoring. She was in charge of implementing the streaming engine of the dedicated platform of the competition. Dihia explained its components, the technologies used, and the challenges met to build the platform. At the end, the platform was able to provide multiple streams to multiple users, to receive multiple streams, to process them and to provide the leader board and live results.

 

Special thanks to our keynote guests, Matteo Merli and John Roesler, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!

 

Releated Posts

Privacy Enhancing Technologies 2024: A Summary

For Large Language Models (LLMs), Azure confidential computing offers TEEs to protect data integrity throughout various stages of the LLM lifecycle, including prompts, fine-tuning, and inference. This ensures that all
Read More

IEEE Big Data 2023 – A Summary

Our CTO, Sabri Skhiri, recently travelled to Sorrento for IEEE Big Data 2023. In this article, Sabri explores for you the various keynotes and talks that took place during the
Read More