In April 2022, our research director Sabri Skhiri travelled to Zurich to attend the Privacy Enhancing Technologies Summit 2022, dedicated to PETs and their uses (enhance data security, facilitate compliance, and create value).Continue reading
The research community at its best!
This week, our research director Sabri Skhiri flew to Milan with other leaders in stream reasoning. They meet to push further efficient decision-making and context detection over rapidly changing data.
Congratulations to our research director Sabri Skhiri on his appointment as industry co-chair of the international conference on distributed and event-based systems.Continue reading
After the success of five international workshops co-located at IEEE Big Data, the MDPI Data Journal is dedicating a special issue to real-time stream analytics, stream mining, CER/CEP and stream data management in big data.Continue reading
Reinforcement learning is one of the most active research areas in artificial intelligence and applies to a wide range of use cases in different sectors. To provide students with the skills needed in a transforming AI landscape, the ENSI school invited us to give a course on the subject.Continue reading
Last Saturday, our Tunisian team Safa, Ichraf Hamza and Amine took part in the ENSI (Ecole Nationale des Sciences de l’Informatique) virtual forum to share their experience and meet the students! Our graph specialist Amine Ghrab talked to students about the power of graph analytics.Continue reading
Last Thursday, our engineer Amine Ghrab presented the BI on Graph project during his PhD public defense. Amine did an amazing job at the edge between Industry & Academia. Amine’s thesis was done in collaboration with the CODE/WIT Lab of the Université Libre de Bruxelles and the Universitat Politècnica de Catalunya, with the support of Prof. Oscar Romero & Prof. Esteban Zimanyi!
In his PhD thesis, Amine defined how BI environments can be enriched with Graph Data structures. Over the past decade, business and social environments have become increasingly complex and interconnected. As a result, graphs have emerged as a widespread abstraction tool at the core of the information infrastructure that supports these environments. In particular, the integration of graphs into data warehouse systems has appeared as a way to extend current information systems with graphs management and analysis capabilities. Going forward, Amine redefined the concepts of multidimensional cube on graph and showed how it can open new doors for data analysts. Finally, he showed how a graph data warehouse architecture can be defined.
Congratulation for your achievements!
You can find below a list of related publications:
- TopoGraph: an End-To-End Framework to Build and Analyze Graph Cubes
- GraphOpt: a Framework for Automatic Parameters Tuning of Graph Processing Frameworks
- Graph BI & Analytics: Current State and Future Challenges
- Discovering interesting patterns in large graph cubes
- A Framework for Builidng OLAP Cubes on Graphs
Two weeks ago, our young research engineers Hounaida Zemzem and Rania Saidi were in New York for the Thirty-Fourth AAAI Conference On Artificial Intelligence. The conference promotes research in artificial intelligence and fosters scientific exchange between researchers, practitioners, scientists, students, and engineers in AI and its affiliated disciplines. Rania and Hounaida attended dozens of technical paper presentations, workshops, and tutorials on their favourite research areas: reinforcement learning for Hounaida and graph theory for Rania. What were the big trends and their favourite talks? Let’s find out with them!
The Big Trends:
Rania says: “The conference focused mostly on advanced AI topics such as graph theory, NLP, Online Learning, Neural Nets Theory and Knowledge Representation. It also looked into real-world applications such as online advertising, email marketing, health care, recommender systems, etc.”
Hounaida adds: “I thought it was very successful given the large number of attendees as well as the quality of the accepted papers (7737 submissions were reviewed and 1,591 accepted). The talks showed the power of AI to tackle problems or improve situations in various domains.”
Favourite talks and tutorials
Hounaida explains: “Several of the sessions I attended were very insightful. My favourite talk was given by Mohammad Ghavamzadeh, an AI researcher at Facebook. He gave a tutorial on Exploration-Exploitation in Reinforcement Learning. The tutorial by William Yeoh, assistant professor at Washington University in St. Louis, was also amazing. He talked about Multi-Agent Distributed Constrained Optimization. Both their talks were clear and funny.”
Rania’s feedback? “One of my favourite talks was given by Yolanda Gil, the president of the Association for the Advancement of Artificial Intelligence (AAAI). She gave a personal perspective on AI and its watershed moments, demonstrated the utility of AI in addressing future challenges, and insisted on the fact that AI is now necessary to science. I also learned a lot about the state of the art in graph theory. The tutorial given by Yao Ma, Wei jin, Lingfu Wu and Tengfei Ma was really interesting. They explained Graph Neural Networks: Models and Applications. Finally, the tutorial presented by Chengxi Zang and Fei Wang about Differential Deep Learning on Graphs and its Applications was excellent. Both were really inspiring and generated a lot of ideas about how to continue to expand my research in the field! ”
A personal selection by Rania & Hounaida of interesting papers to check out :
- Generalizable Resource Allocation in Stream Processing via DRL, by Xiang Ni, Jing Li, Mo Yu, Wang Zhou, and Kun-Lung Wu. This paper considers the problem of resource allocation in stream processing, where continuous data flows must be processed in real-time in a large distributed system.
- Scaling All-Goals Updates in Reinforcement Learning Using Convolutional Neural Networks, by Fabio Pardo, Vitaly Levdik, and Petar Kormushev. The authors propose to use convolutional network outputs (Q-values) to generate several sub-goals at once. And this, in order to better guide the agents.
- From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning, by George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Perez. The paper tackles the problem of constructing abstract representations for planning in high-dimensional, continuous environments.
- Optimizing Reachability Sets in Temporal Graphs by Delaying, by Argyrios Deligkas and Igor Potapov.
- Learning Hierarchy aware knowledge Graph Embeddings for Link Prediction, by Zhanqiu Zhang, Jianyu Cai, Yongdong Zhang, and Jie Wang. The authors propose a novel knowledge graph embedding model which maps entities into the polar coordinate system reflecting hierarchy.
- Multi-View Multiple Clustering using Deep Matrix Factorization, by Shaowei Wei, 1Jun Wang, Guoxian Yu, Carlotta Domeniconi, and Xiangliang Zhang. The paper introduces a solution to discover multiple clusterings. It gradually factorizes multi-view data matrices into representational subspaces layer-by-layer and generates one clustering in each layer.
After attending their first conference as Euranovians, what will Rania & Hounaida remember? Hounaida concludes: “Going to New York for the AAAI-20 Conference as one of the ENX data scientists was an amazing experience. I met many brilliant and sharp international experts in various fields. I enjoyed the one-week talks with so many special events, offline discussions, and the night strolls!”
Which direction stream and complex event processing is going to take? Last week, the world’s best-known international researchers met in Schloss Dagstuhl, Germany, to present and discuss their research. Among the members were present Avigdor Gal, Professor at the Israel Institute of Technology, Alessandro Margara, Assistant Professor at the Polytechnic University of Milan, or Till Rohrmann, engineering lead at Veverica.
Invited to talk about the requirements and needs from the industry, our R&D director Sabri Skhiri explains: “The seminar brought together world-class computer scientists and practitioners working on complex event recognition, distributed systems, databases, stream reasoning and artificial intelligence. Our objective was to disseminate the recent foundational results in each of these isolated fields among all participants, to identify the open problems that need to be resolved, and to establish new research collaborations among these fields”.
What were the big trends and intakes gathered by those brilliant minds? Let’s find out with Sabri!
The Big Trends
This seminar is a bit particular as it does not show any trends but rather gives a picture of all the communities working on CER in a way or another. I was fascinated by the diversity of researchers. I did not expect to see such a rich variety of fields: knowledge representation, spatial reasoning, logic-based reasoning, data management, learning-based approaches, event-driven processing, process mining, database theory, stream mining,… According to me, the composite event recognition models that are the best at recognising complex events would include:
- Data flow model
- Ontology-based and reasoning model
- Symbolic reasoning model
- Automata-based model
We also identified common challenges across these models and communities. The three priority topics areas we identified are:
- Expressivity: composability & hierarchies
- Evaluation strategy, parallelization and distribution
- Uncertainty management
Kurt Rothermel from TU Stuttgart – Time-sensitive Complex Event Processing
My first reaction to load shedding was: “It is useless since customers do not want to lose any event, that is why so much effort is spent today on exactly once semantics…“. However, there is a trend today in stream processing, which is the trade-off between cost, latency, and correctness. Tyler Akidau described this challenge as a choice between one of three propositions: fast and correct, cheap and correct, or fast and cheap. Tyler was talking about streaming but that rule applies in the same way in a CEP context. The load shedding strategy directly falls in the third proposition. In this perspective, the work of Kurt is highly relevant.
Jacopo Urbani & Fredrik Heintz – Stream Reasoning
Concretely, stream reasoning is incremental reasoning over rapidly changing information. The tutorial opened new perspectives on stream processing for me. It tried to answer a very interesting question: how can you provide reasoning about context from streams of data? I definitely come from the database and event-based systems communities and I did not know at all that stream reasoning was so mature. This community has been evolving from having a continuous version of SPARKQL to a complete distributed stream reasoning semantics. It is interesting to see that the work we have done in the LEAD algebra and semantics is deeply inspired by this community. However, we have never used any reasoning logic on top of LEAD. But after a few hours of the tutorial, I realise that (1) reasoning can be used for query rewriting and optimisation (2) it is worth evaluating at least BigSR, the LARS implementation on Flink.
Avigdor Gal & Ruben Mayer – Distributed and Event-Based Systems
Avidgor is a kind of pop star for the stream processing and distributed systems community, or at least for me! The papers he published about a probabilistic CEP engine with late arrival and event uncertainty were visionary.
The speakers started by explaining the basics of stream processing then went deeper into the event recognition language and architecture. They detailed pub/sub applied to event recognition and explained the data flow model, which consists of a single unified data processing model where the stream and batch paradigms are the same. This last part was based on Tyler Akidau’s paper.
A second part of the talk focused on elasticity on streams. Stream fission puts operators among different categories:
- Firstly, key-based operators, that is a group by operation (as in SQL)
- Secondly, window-based operators enable to split processing that needs to have multiple event types correlated with different keys within the same operator
- Finally, pane-based operators enable a split-merge strategy where you distribute and merge the result.
Interestingly, Avigdor presented his work about late-arrival processing from a probabilistic viewpoint and not from the watermark perspective. Usually, modern stream processing frameworks use watermarks in order to take into account events that arrive later. Avigdor presented a probabilistic approach to this issue.
What are late-arrival events?
Imagine we want to count the number of cars entering a road segment every three minutes: we have a “tumbling window” every 3 minutes. If an event (ie a car) arrives at 2’55 second in the window but is stuck somewhere in the network for 6 sec, it is called a late-arrival event. The processing time (the time at which the CEP processes the event) is delayed compared to the event time (the time on which the event really occurs).
Note that for CEP, there is clearly a trade-off between timeliness and accuracy, because the slack time will increase the delay to deliver your result but will increase your accuracy. There is always a tradeoff between cost, latency and correctness, and usually, you can only pick two among the three.
Fun fact: If you need to explain what is event time & processing time to your mother (yeah, don’t underestimate the power of this kind of discussion at Christmas dinner), the best way is to take the Star Wars analogy. From an event time perspective (which is the time at which the story really happened) you should follow episode 1, 2, 3,4, 5, 6, 7,8, 9. But if you take the processing time (the time on which we received the episode), it is 4, 5, 6, 1, 2, 3, 7, 8, 9. Isn’t it great ?!
CER has been explored from many viewpoints. However, never in the research history was there a meeting gathering representatives of these communities. This was the objective of this seminar. Having all these people in a castle in the middle of nowhere was a blast! I had very passionate discussions during meals but also during the night at the library with the most brilliant brains on stream and CEP. On the other hand, I still had some fun discussions about comparing Star Trek DIscovery and Picard! Finally, the most important things I will remember after this seminar… are the endless ping pong games with Till Rohrmann and Alessandro Margara :-).
Last December, Eura Nova’s research center held the fourth workshop on real-time and stream analytics in big data at the 2019 IEEE Conference on Big Data in Los Angeles. The workshop brought together leading players including Confluent, Apache Pulsar, the University of Virginia and Télécom Paris Tech as well as 8 renowned speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in stream mining, IoT, and industry 4.0.
The workshop was a real success with many interesting questions and comments. If you could not attend, our R&D engineer Syrine Ferjaoui brought back important elements from the presentations for you.
First keynote speaker:
First of all, the workshop started with the keynote of Matteo Merli, PMC member at Apache Pulsar. His talk “Messaging and Streaming” explained how Pulsar can be a unified infrastructure that supports messaging and streaming.
Matteo introduced messaging as events that are being created and streaming as analysing events that just happened. These are two different processing concepts but they need a single infrastructure. He then explained the architecture view of Pulsar, which has separate layers between the brokers and the bookies (BookKeeper instances that handle persistent storage of messages). This means that brokers and bookies can be added independently, traffic can be shifted very quickly across brokers, and new bookies will ramp up on traffic quickly. This segmented distribution makes the architecture of Pulsar more flexible and dynamic.
Pulsar has other interesting features such as durability, low latency, high throughput, high availability, unified messaging model, high scalability, native computing, … The roadmap includes working on Pulsar storage API to allow direct access to data stored in Pulsar and to retrieve and process data more efficiently. They are also working on higher-level messaging features.”
Second keynote speaker:
The second keynote was given by John Roesler, a Kafka committer at Confluent. He talked about Kafka Streams and the evolution of streaming paradigms.
To design software, we, developers, used to separate the application logic from the database. To scale the database capacity, we then started to use a search index to do ETL jobs and query the database in a fast and optimal way. However, this created bugs in the software, added data consistency issues, and created more complexity in the system. Later, we started to use HDFS for a more flexible design. While enabling replication and distributed storage, this solution added more latency and supported batch processing only. It did not meet the needs of real-time processing use cases.
At this point, streaming helped a lot. The next step was to add a streaming platform that reads from sources, does some computation, and sinks the result somewhere else. The KafkaStreams design is a set of multiple lambda stateful functions, which makes it a good fit for a microservices architecture. With Kafka Streams’ new updates, the app logic is linked to a relational database with ACID guarantees.
Finally, John Roesler considers that “software is a fractal”, a never-ending pattern: a software architecture is complex and even when we zoom into a single component, it is still complex. But for the Kafka Streams’ design, when we zoom out, it looks like a set of services interacting and connected to each other and this simplifies the aforementioned designs.
John concluded by mentioning open problems that can be dealt with in stream processing, including semantics, observability, operability, and maintainability.
Workshop Invited Speakers:
After the keynotes, 8 selected papers were presented, covering mainly these 6 topics: (1) Stream Processing for IoT, (2) Serverless and HPC (High Performance Computing), (3) Collaborative Streaming, (4) Stream Mining, (5) Image Mining and (6) Real-time Machine Learning. Some papers are not yet available, as they will be published in the proceeding of the IEEE Big Data Conference. In the meantime, do not hesitate to contact our R&D department at email@example.com to discuss how you can leverage stream processing in your projects.
- Scalable and Reliable Multi-Dimensional Aggregation of Sensor Data Streams (Sören Henning, Wilhelm Hasselbring)
Sören and Wilhelm are engineers in the Software Engineering Group from Kiel University. They propose a stream processing architecture which allows for aggregating sensors in hierarchical groups, supports multiple hierarchies in parallel, provides reconfiguration at runtime, and preserves the scalability and reliability qualities of streaming.
- Performance Characterization and Modeling of Serverless and HPC Streaming Applications (Andre Luckow, Shantenu Jha)
Andre Luckow, head of Blockchain and Emerging Technologies at BMW Group, and Shantenu Jha, associate professor at Rutgers University, presented StreamInsight, which provides insight into the performance of streaming applications and infrastructure, their selection, configuration, and scaling behaviour.
- Collaborative Streaming: Trust Requirements for Price Sharing (Tobias Grubenmann, Daniele Dell’Aglio, Abraham Bernstein)
The paper is written by Tobias Grubenmann, researcher at The University of Hong Kong, in collaboration with Daniele Dell’Aglio and Abraham Bernstein, researchers at the University of Zurich. They present the Collaborative Stream Processing (CSP), a model where the costs, which are set exogenously by providers, are shared between multiple consumers, the collaborators. For this, they identify the important requirements for CSP to establish trust between the collaborators and propose a CSP algorithm adhering to these requirements.
- Kennard-Stone Balance Algorithm for Time-series Big Data Stream Mining (Tengyue Li, Simon Fong, and Raymond Wong)
Tengyue Li and Simon Fong (researcher and associate professor at the University of Macau, China) and Raymond Wong (associate professor at UNSW Sydney) worked on the Kennard-Stone Balance algorithm used as a new data conversion method. Training a prediction model effectively using big data streams poses certain challenges in machine learning. In this paper, the authors apply the Kennard-Stone algorithm on time-series to extract a meaningful representation of big data streams, which improves the performance of a machine learning model.
- Assessing the Effects of TV Ad Events on Digital Search: On the Selection of Outcome Measures (Shawndra Hill, Anthony Colas, H. Andrew Schwartz, and Gordon Burtch)
Shawndra Hill (Microsoft), Anthony Colas (University of Florida), H. Andrew Schwartz (State University of New York at Stony Brook) and Gordon Burtch (University of Minnesota) explained their work on the interactions between TV content and online behaviours such as response to digital advertising. They developed AdMiner, a tool that can track online activity around a brand and provide actionable insights into ad campaigns.
- MLK Smart Corridor: An Urban Testbed for Smart City Applications (Austin Harris, Jose Stovall, and Mina Sartipi)
Austin Harris, Jose Stovall, and Mina Sartipi (researchers and CUIP director at the University of Tennessee at Chattanooga) have helped to create Chattanooga’s smart corridor, used to test new technologies and generate data-driven outcomes. In their talk, they presented the corridor, used as a test bed for research in smart city developments in a real-world environment. The wireless communication infrastructure and network of sensors in combination with data analytics provide a means of monitoring and controlling city resources and infrastructure in real time.
- Image Mining for Real Time Quality Assurance in Rapid Prototyping (Sebastian Trinks and Carsten Felde)
Sebastian Trinks and Carsten Felde (TU Bergakademie Freiberg) presented how image mining can help avoiding errors and low quality of printed prototypes in real time. This can result in saving resources and increasing efficiency when developing new products.
This year, IEEE Big Data held the Real-time Machine Learning Competition on Data Streams. As the competition is focused on streaming, its online platform required a specific infrastructure that meets data stream mining requirements. Dihia Boulegane is a Ph.D. student at Télécom ParisTech working in collaboration with Orange Labs on machine learning for IoT networks monitoring. She was in charge of implementing the streaming engine of the dedicated platform of the competition. Dihia explained its components, the technologies used, and the challenges met to build the platform. At the end, the platform was able to provide multiple streams to multiple users, to receive multiple streams, to process them and to provide the leader board and live results.
Special thanks to our keynote guests, Matteo Merli and John Roesler, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!