EURA NOVA is launching an intense 3-month I.T. boot camp starting September 2019.
Third Workshop on Real-Time and Stream Analytics in Big Data: key takeaways
Last month, EURA NOVA research centre organised the third workshop on real-time and stream analytics in big data, collocated with the 2018 IEEE conference on big data in Seattle. The workshop brought together the leading actors in the field including data Artisans, the University of Virginia and Télécom Paris Tech as well as 9 well-known speakers from 6 different countries. We received more than 30 applications and we are proud to have hosted such interesting presentations of papers in data architecture, stream mining, complex event processing and IoT.
The workshop was a real success, with captivating talks and a lot of interesting questions and comments. If you could not attend the event, our R&D engineer Syrine Ferjaoui has brought back for you the important elements from the keynotes and the presented papers.
First keynote speaker:
First of all, the workshop started with the keynote of Fabian Hueske, PMC member at Apache Flink & co-founder of data Artisans. His talk “Unified Processing of Static and Streaming Data with SQL on Apache Flink” presented Flink’s features and its relational unified APIs for batch and streaming data. Fabian Hueske insisted on the importance of unifying stream and batch for 2 major points: the usability and the portability. Flink includes a set of features such as materialised views to speed-up the analytical queries, dynamic tables, updates propagation and processing, continuous queries, approaches to handle time in stream processing, watermarks and queries on infinite sized tables. With all these features, Flink helps its users to build data pipelines with low-latency ETL, stream & batch analytics and to power live dashboards.
Our research director Sabri Skhiri adds: “Apache flink is currently working on a set of connectors. They have already the HDFS sink, the JDBC sink and since they are pushing Flink as the standard technology for data pipelines and materialised views, they want to expand their connectors set.”
Second keynote speaker:
Secondly, our research director Sabri Skhiri talked about data management, and stream and real-time analytics. His talk “The challenge of Data Management in the Big Data Era & its underlying Enterprise architecture shift” started with defining data architecture as a global plan depicting how to collect, store, use and manage data to answer the 8 main challenging questions that are essential to building a solid and efficient solution. During his talk, our director considered deriving microservices from data streams as the new wave of architecture and he discussed the Data Architecture Vision (DAV) set throughout 10 years of research and development at EURA NOVA. The DAV later led to the development of digazu, a data engineering platform containing all the different components needed to collect, store, govern, transform, and analyse all the data in the company’s IT environment.
Workshop Invited Speakers:
After the keynotes, 9 selected papers have been presented, covering mainly these 4 topics: (1) Data Streaming Architecture, (2) CEP/CER, (3) Stream Mining & (4) IoT Device Integration:
- A Scalable and Robust Framework for Data Stream Ingestion (Isah and Zulkernine):
Isah and Zulkernine (Queen’s University, Kingston, Canada) propose a scalable and fault-tolerant data stream ingestion and integration framework that can serve as a reusable component across many feeds of structured and unstructured input data in a given platform. Our R&D engineer Syrine Ferjaoui explains: “The ingestion layer (that integrates Apache NiFi and Kafka) is used to decouple streaming analytics layers (acquire, buffer, pre-process, distribute data streams). This NiFi-Kafka “NiFKaf” integration takes advantage of the high configuration of NiFi and the addition of several data of consumers provided by Kafka.This way, it supports many data sources, languages and content formats, ensures high throughput and low latency, supports large numbers of data consumers, enables data buffering during temporary spikes in workload and employs a replay mechanism, and is scalable”.
- Edge Computing Architecture to Support Real Time Analytic Applications (Trinks & Felden):
The paper by Trinks & Felden (TU Bergakademie Freiberg, Germany) presents Edge Computing which is an extended approach to cloud computing. It describes an architecture scheme that consists of 3 layers: node layer (gadgets, smartphones, embedded systems, sensors), edge layer (routers, switches, small/macro base station) and cloud layer (datacenters, servers, databases, storages). Edge Computing is used to minimise energy consumption, bandwidth, latency and increase safety and privacy level and employs real-time analytics within its architecture.
- Distributed Real Time Link Prediction on Graph Streams (Katragadda, Gottumukkala, Pusala, Raghavan, Wojtkiewicz):
Link prediction refers to the likelihood of a link appearing in the future based on the current status of a graph. The previous works for link prediction such as sketch-based approaches and dynamic attributed networks do not give exact results and cannot handle deletion or modification in the graph nor the large volume of data. The goal of the authors (University of Louisiana, USA) is to design a graph-processing approach for link prediction that ensures real-time prediction and extraction of accurate features from the graph with exact results. Syrine details: “Graph processing can be edge-centric, vertex-centric or neighbourhood centric. This paper proposed two new graph processing frameworks for handling each graph streams: vertex-centric processing & neighbourhood-centric processing. These frameworks are able to predict 100% of the links with an average graph ingestion time between [149.3 – 242.7] ms”.
- DisPatch: Distributed Pattern Matching over Streaming Time Series (Hamooni and Mueen):
Researchers from the University of New Mexico have developed a robust distributed matching system, called DisPatch. In a scenario where multiple data sources or producers publish data to the Kafka system, DisPatch is the data consumer that matches a pattern with a guaranteed maximum delay after the pattern appears in the stream. Syrine reacts: “Given a time series T of length n, and a query Q of length m, it normally takes O(nm) to calculate the Euclidean distance/correlation between Q and all subsequences of T, but this method calculates the results in O(log(n)) by exploiting the overlaps. As a result, DisPatch guarantees exactness and bounded delay at the same time”.
- Using Information in Access Logs for Large Scale User Identity Linkage (Jalali, Krishnamoorthy, and Biswas):
In this paper, the authors (Adobe Research, California, USA) discuss Adobe’s Identity Graph that provides a comprehensive solution to the challenge posed by fragmentation of identities. Our R&D engineer details: “Identity graph helps in connecting data across channels, domains and devices to solve a fundamental problem in the Digital Marketing domain. The fragmented profiles of a consumer are linked together in order to provide a unified view across devices. This means that an identity graph connects all the known identifiers that correlate with the individual consumer. The researchers built identity relationships by using both online data traffic and offline CRM data logs from customer’s backend systems. To do that, they are using two approaches: deterministic linking and probabilistic linking. They combined them using deterministic as a base and expanding using probabilistic clusters”.
- Streaming Algorithm for Big Data Logistic Regression (Yang, Wang, Xu & Zhang)
The authors (Purdue University, USA) propose a novel fitting algorithm for big data logistic regression by combining Fisher Scoring and IRWLS. Syrine details: “The revised IRWLS algorithms can break the memory barrier and is suitable for streamed computing. It is per row updatable and does not need to load the whole dataset into the memory. This algorithm has a fast convergence speed (usually around 3). The limitation of this method is the structured data with large n (rows) and small p (columns)”.
- Efficient Dynamic Time Warping for Big Data Streams (Martins & Kerren):
Dynamic Time Warping (DTW) is able to match natural time series with similar shapes, but a different length of patterns. The authors (Linnaeus University, Sweden) described enhancements to the DTW algorithm that allow it to be used efficiently in a streaming scenario. Syrine explains: “Their solution is composed of 3 parts: (1) a very fast implementation of the DTW (2) an append operation for the DTW which works in linear or constant time and (3) an approximation of a sliding window that allows DTW to forget old time steps, improving the processing of “never-ending” streams. In short, DTW encapsulates all data behaviour information in a single value and enables the use of a tiny fraction of data compared to the original sensed data while still obtaining highly accurate results”.
- Using Candlestick Charting and Dynamic Time Warping for Data Behavior Modeling and Trend Prediction for MWSN in IoT Concepcion (Aleman, Pissinou, Alemany, Kamhoua)
There is a rapid emergence of new applications involving mobile wireless sensor networks (MWSN) in the field of Internet of Things (IoT). Although useful, MWSN still carry the restrictions of having limited memory, energy, and computational capacity. At the same time, the amount of data collected in the IoT is exponentially increasing.The authors (Florida International University, USA) propose a Behavior-Based Trend Prediction (BBTP), which is a data abstraction and trend prediction technique, designed to adress the limited memory constraint in addition to providing future trend predictions. Predictions made by BBTP can be employed by real-time decision-making aplications and data monitoring.
- A multi-dimensional extension of the Lightweight Temporal Compression method (Li, Sarbishei, Nourani, and Glatard):
Lightweight Temporal Compression (LTC) is among the lossy stream compression methods that provide the highest compression rate for the lowest CPU and memory consumption. As such, it is well suited to compress data streams in energy-constrained systems such as connected objects. In this paper, Li, Sarbishei, Nourani and Glatard (Concordia University & Motsai Research, Canada) investigate the extension of LTC to higher dimensions. Syrine adds: “They described how multi-dimensional LTC compression saves substantial amounts of energy (up to 20%) and is feasible on connected objects. The implementation with Euclidean norm is more intuitive than infinity norm for nD sensors, as well as more CPU & memory intensive and leads to lower compression ratios”.
Special thanks to our keynote speaker Fabian Hueske, and all the attendees and speakers! We are looking forward to an even more successful workshop in the coming edition of the IEEE Big Data Conference. Stay tuned for paper submission dates!
7 Publications in 2018
At EURA NOVA, we believe investing in research allows us to continuously become more proficient, to maintain our know-how at the cutting edge of IT, and to share its benefits with our customers. As we look back on the year 2018, we are both proud and happy to announce that our R&D department has published 7 publications this year:
Firstly, our paper “Pairwise Image Ranking with Deep Comparative Network” was published at the 26th European Symposium on Artificial Neural Networks. The paper, written by our Lead R&D engineer Aymen Cherif and Salim Jouili, discuss how using the pair-wise ranking model can provide better results for instance-level image retrieval.
Aymen Cherif, Salim Jouili, Pairwise Image Ranking with Deep Comparative Network. ESANN 2018: ES2018-200
Secondly, our R&D engineer Cécile Pereira participated in the redaction of a paper published in Bioinformatics in May 2018. They propose a novel end-to-end deep learning approach for biomedical NER tasks that leverage the local contexts based on n-gram character and word embeddings via Convolutional Neural Network.
Qile Zhu, Xiaolin Li, Ana Conesa, Cécile Pereira, GRAM-CNN: A deep learning approach with local context for named entity recognition in biomedical text, Bioinformatics – May 2018
In July, our R&D engineer Katherine Krasnoschok was in Melbourne, Australia to attend the ACL conference. She presented her poster on topic modelling. Her paper, co-written with Salim Jouili, indicates that involving more named entities positively influences the overall quality of topics.
Katsiaryna Krasnashchok, Salim Jouili, Improving Topic Quality by Promoting Named Entities in Topic Modeling, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018
Moreover, our paper “Graph BI & Analytics: Current State and Future Challenges” was accepted for publication and presented at the 20th International Conference on Big Data Analytics and Knowledge Discovery, taking place in Germany in September. The paper presents the state of the art of graph BI & analytics, with a focus on graph warehousing.
Amine Ghrab, Oscar Romero, Salim Jouili, Sabri Skhiri, Graph BI & Analytics: Current State and Future Challenges. DaWaK 2018: 3-18
In September as well, our paper Data Mining and Machine Learning Techniques supporting Time-based Separation Concept Deployment, co-written with Eurocontrol and WaPT, was presented at the 37th Digital Avionics Systems Conference (DASC) in London, U.K. The paper presents two methods to allow air traffic controllers to deliver separation minima accurately and safely, on the basis of time intervals instead of distances.
De Visscher, I.; Stempfel, G.; Rooseleer, F. & Treve, V.; Data mining and Machine Learning techniques supporting Time-Based Separation concept deployment, in 37th Digital Avionics Systems Conference (DASC), pp 594-603, London, UK, September 23-27, 2018
Finally, our engineer Katsiaryna Krasnashchok presented in October her poster on Hierarchical Attention-Based Neural Topic Model at the 6th International Conference on Statistical Language and Speech Processing. Furthermore, our Lead R&D engineer Aymen Cherif and our bootcamper Luca De Petris presented as well their poster on LSTM Siamese Network.
Katsiaryna Krasnashchok, Salim Jouili, Hierarchical Attention-Based Neural Topic Model, SLSP 2018
Luca De Petris, Aymen Cherif, LSTM Siamese Network for Question Answering System, SLSP 2018
IEEE Big Data 2018: a summary
At the beginning of the month, our R&D director Sabri Skirhi and our R&D engineer Syrine Ferjaoui travelled to Seattle to attend IEEE Big Data. The conference is one of the most influent in this domain, gathering more than 1100 attendees, 5 keynotes, 9 tutorials, and 8 daily tracks in parallel. Back in Belgium, our R&D director gives you his opinion on the conference itself and the important elements from the keynotes, the tutorials, the workshops and the interesting papers.
Favourite Talks
Keynote 1: Decentralized Machine Learning – Google AI
The IEEE Big Data conference started with the inspiring keynote of Blaise Agüera y Arcas, a distinguished researcher at Google AI. Our director details: “The straightforward thesis of the talk is that we can, and we must, use the mobile device for local deep neural network computing. Blaise Agüera explained that since the launch of Tensorflow, Google Brain has built specialised hardware servers to run efficiently deep neural network computing jobs. Nowadays, we find on the market specialised chips that are smaller than a coin of 1 cent and that costs less than a cappuccino. Using them, you can run very efficiently deep neural net computing jobs on mobile at low frequency, low energy and even continuously. For example, the Google camera embeds deep neural nets and does not need to send data to the server side for face or situation detection. But Dr Blaise is going further. He works on reusing the existing techniques in distributed neural net and sharing the learned gradient in a parameter server and sharing them to all device. This is what we call federated learning, and it has impacted many research areas, such as edge computing. The idea of edge computing is to execute light tasks on the edge of the network in order to offload the server/cloud. But here, this is changing the game since the nature of the job is not light anymore. In addition, the concept of federated learning does not try to offload the server but changes the role of the server as a coordinator between edge devices. Secondly, it has impacted neural net compression. The question is then: do we still need to compress networks when we can either distribute the neural net on the server side or have specialised chips on the device side?”
Keynote 2: Big Data for Speech and Language Processing – MSF Research
The second keynote, Xuedong Huang, is a Microsoft Technical Fellow of Microsoft Cloud and AI. He was presenting the latest advances in Speech recognition and Text To Speech (TTS). The key papers behind this technology can be found here and on the research group page. Our director explains: “The first part of the keynote was about the MSF live captioning that will be soon integrated natively in PowerPoint. That is just impressive. Everything that the speaker is saying is capturing by the tool. I personally tested the Translator Android application and it works just fine! The second part of the keynote was focused on the Text To Speech (TTS). The speaker was showing a set of very interesting examples of how voice can be modelled. For instance, if the system learns a model out of hours of discussions, it can apply my voice in Chinese or Arabic or it can learn from a group of person in order to get a better accent and expression”.
The Tutorials
This year, IEEE Big Data organised 9 tutorials. Our R&D director explains: “This is probably what I like the most at an academic conference. A research group presents a complete state-of-the-art review in their domain and usually position their own work in the story. My favourite was Progress in Zeroth Order Optimization and Its Applications to Adversarial Robustness in Deep Learning. It was one of the coolest research topics I have seen so far. They discussed how you can fool a deep neural network in order to get a wrong classification. The idea is great: finding the minimal noise you can add to a picture in order to increase the probability of a wrong classification. In this setting, you don’t know anything about the classifier, but you can submit images and you will get a label. Indeed, that looks like a black box optimisation setting. That is precisely why they use Zeroth order optimisation. The research topic is so cool, you can manage to fool the classifier to make it recognize a piano in an image picturing a bagel! Can you imagine the impact, at the era of the electronic passport, where image recognition starts to be used in the signature process? What if I can find how to fool an algorithm to be classified as someone else with just a few grey pixels on my picture?”
The Workshops
EURA NOVA research centre organised the third workshop on Real-time and Stream analytics in Big Data, collocated with the 2018 IEEE conference on Big Data. Our Research Director Sabri Skhiri talked about data management, and stream and real-time analytics. Thank you to our keynote speaker Fabian Hueske, and all the attendees and speakers! They had a great time, with captivating talks and a lot of interesting questions and comments. The summary of the event is available on our website. The slides of the opening session and the slides of the second keynote are available here.
Final Feelings
In the early age of the conference, IEEE Big Data was mainly focused on the big data infrastructure. In the following years, the conference became data science oriented, with a significant increase in the number and the complexity of data science use cases. When we asked how he felt about the event, Sabri explained: “I have been attending this conference since the first occurrence. The most important shift I have seen is really about the content. This year, the infrastructure papers have almost disappeared. On the other hand, the vast majority of the publications are on data science. We can really see that it is becoming a conference for ML practitioners. The side effect is the complexification of the discussed topics. Machine learning notions are supposed to be known, deep neural networks are becoming the norm. Going further, the authors are also good at using distributed frameworks, especially Spark. For them, the infrastructure is not a problem anymore, this is part of the daily job”.
The Papers
A personal selection of interesting papers:
- Learning Effective Embeddings for Machine-Generated Emails with Applications to Email Category Prediction: a nice paper from Google, Twitter and Facebook. The main intuition is that the B2C email in your mailbox and especially the sequences of such email can tell a lot about you and your interest and even your next actions. The idea is then to find a relevant embedding for emails. Interesting, especially in the context of B2C emails.
- Learning-based Automatic Parameter Tuning for Big Data Analytics Frameworks: they don’t consider job feature and then they need to re-train the performance model for each job. But the approach for the optimization is interesting. Especially, the intuition behind space exploration and space exploitation that has its roots in Reinforcement Learning.
- A Reinforcement Learning Based Resource Management Approach for Time-critical Workloads in Distributed Computing Environment: although the paper is not rocket science and even a bit naive, it represents a complete Reinforcement Learning modelling. They give a good illustration of the state, action and reward modelling.
- An Integrated Knowledge Graph to Automate GDPR and PCI DSS Compliance: the idea is to parse a law text such as the GDPR regulation and to create a knowledge graph (using triplet RDF representation) and then to be able to ask questions. In the future works, they plan to be able to check the compliance of your data privacy policy with the knowledge graph.
Improving Topic Quality by Promoting Named Entities in Topic Modeling
In July, our R&D engineer Katherine Krasnoschok was in Melbourne, Australia to attend the ACL conference. She presented her poster on topic modelling. Her paper, co-written with Salim Jouili, indicates that involving more named entities positively influences the overall quality of topics.
News-related content has been extensively studied in both topic modeling research and named entity recognition. However, expressive power of named entities and their potential for improving the quality of discovered topics has not received much attention. In this paper, we use named entities as domain-specific terms for news-centric content and present a new weighting model for Latent Dirichlet Allocation. Our experimental results indicate that involving more named entities in topic descriptors positively influences the overall quality of topics, improving their interpretability, specificity and diversity.
Katsiaryna Krasnashchok, Salim Jouili, Improving Topic Quality by Promoting Named Entities in Topic Modeling, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018.
Click here to access the paper.
Spark+AI Summit: a summary
A few weeks ago, Sabri Skhiri and Florian Demesmaeker were in London to attend the Spark+AI summit. They came back with a lot to say about the new features of Spark and the presented use cases! In this article, they will give you their opinion about Databricks’ main announcement, the intakes of their favourite talks and training, and what they thought of the new name of the conference.
A new name
This year, Spark expanded the summit’s scope and renamed it “Spark + AI Summit”. The goal of Databricks, announced by its co-founder Ali Ghodsi, is to incorporate unified aspects of data and AI.
Florian Demesmaeker, our R&D engineer, explains: “In some of the keynote talks, the speakers talked about use cases where the job of the data engineer is strongly reduced. The data scientists can easily experiment with data, travelling back and forth in time. This means more focus on AI, rather than on the data engineering part that makes all data accessible to the data scientists”.
Main announcement
In line with this change of name, Databricks announced the release of a complete data science lifecycle on the cloud.
Sabri Skhiri, our R&D Director, explains “It is interesting to see that the change in the event name is actually very visible in the change of Databricks’ strategy. Their tools are now completely dedicated to stream ETL, and there is a huge focus on integrated data management”.
Databricks’ new features include Databricks Delta which creates data pipeline and provides data views and exploration features. Secondly, the Databricks Runtime ML is a ready-to-use environment providing a set of pre-loaded ML frameworks where the data scientist can play with data. Finally, the MLflow tool allows to simplify the ML models development at enterprise scale.
Our R&D Director precises: “Together, these features provide a complete and unified approach to machine learning lifecycle and pipeline automation. This looks like a very competitive SaaS offer for integrated data management, available on AWS and Azure. However, the metadata management and the security aspect is still the missing piece”.
The training day
The first day of the conference was dedicated to training workshops that include a mix of instruction and hands-on exercises to help attendants improve their Apache Spark skills.
Florian gives insights into his favourite training Tuning and Best Practices. He explains: “The aim of the training was to make programmers aware of how Spark works internally, in order to be able to write optimised applications. They presented a few situations, each one showing one relatively slow process. Then they presented a step-by-step procedure to debug the situation and to find the points that could be improved in the current situation. In summary, tips and tricks to adapt to different situations”.
Favourite talks
The sessions at the conference covered data engineering and data science contents along with best practices for productionising AI. The talks were divided into roughly two categories: Spark programming and deployment, and applications on top of Spark (AI applications).
Florian Demesmaeker explains: “I attended 28 talks. The keynotes from Databricks were quite interesting, they presented Delta and MLflow. I also enjoyed the talks about tools to optimise the internals of Spark, these provided good technical details. Other talks were about use cases on top of Spark, it was interesting to see what challenges other companies face and how they address them”.
Sabri Skhiri adds: “The talk Learning to Rank Datasets for Search was very inspiring. Oscar Castañeda-Villagrán, a data scientist working at Xoom (a Paypal service) talked about learning to rank R data set. The idea is that we can extract metadata when the data pipeline is arriving in the lake. Going further, you can not only extract metadata but also calculate a kind of judgment relevance score that will be used for bootstrapping the learning to rank process. In this way, a user can search and retrieve the relevant R data set in the lake. A very good idea for the metadata-driven exploration”.
Early September 2018, 8 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities. You can read their feedback here.
Flink Forward 2018: What You Want to Know and What You (Will) Need to Know.
Early September 2018, 8 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities.
They came back with a lot to say about the hot topics in stream processing and the presented use cases! In this article, they will give you their opinion about data Artisans’ main announcement, the intakes of their favourite talks, and what they thought makes Flink Forward different from other conferences.
First keynote announcement:
During the keynote speech, data Artisans announced that they now bring ACID transactions directly on streaming data with data Artisans Streaming Ledger.
Charles Bonneau, our software architect, says: “This feature allows ACID transactions between multiple operators’ event-processing operations and internal states. This means that streaming applications can now update multiple states in one transaction. For example, an application that transfers money from one bank account to another can finally be implemented using Flink with strong consistency guarantees. Both bank accounts will have their balance updated at the same time as if there was a master data-management state”.
For Sabri Skhiri, our R&D director, this opens the doors to a brand new range of applications, especially in data-driven real-time services but also in streaming data management. He explains: “They are pushing forward the concept of streaming. Now, you could imagine a master data-management state that can be updated by operational streaming applications in real time. This will allow even more complex and advanced use cases of stream processing!”.
Favourite talks:
In 2 days, each Euranovian attended about 18 talks and use case presentations, with speakers from tech giants such as IBM, Netflix, Alibaba, and Uber as well as speakers from smaller companies.
Charles explains: “The conclusions are reassuring: most of them face the same issues that we see at our clients’ and our solutions are all valuable. They include a stream-first data architecture, a stream-first data pipeline product, and Flink developers skills. Even though a number of companies are at the very edge of the technology and their issues do not yet require continuous flows of a considerable amount of events, we are ready”.
For our R&D Director Sabri Skhiri, the keynote speech from Lightbend was one of the most interesting ones. He explains: “Viktor Klang, Lightbend deputy CTO, talked about the convergence between microservices and stream processing. At EURA NOVA, we have been advocating for this convergence for more than a year in our architecture practice. The idea is simple: asynchronous microservices can be designed as stream processing stages. This is fantastic because it makes modern stateful stream processing frameworks the perfect target for implementing reactive microservices. With stateful deployment, exactly once semantics, high availability and ACID access to states, microservices can become stateful streaming apps.”
Vision-oriented Flink Conference:
Our colleagues came back with sparkles in their eyes. When we asked them how they felt about the event, Sabri Skhiri explained:
“Very often, this type of conferences tend to be business oriented. They are focused on how to make the framework easy to use and available to as many people as possible. By contrast, this year’s Flink Forward conference was all about innovation and vision. data Artisans shared their vision of what the Flink framework will be within 3 to 5 years and talked about what role stream processing and big data have within this vision. In fact, almost all the talks were very technical. They were testimonies of big names in the industry, such as Alibaba, Netflix, and ING about problems encountered on the field and how they have been solved, which is often out of the box. The Flink-Alibaba partnership is a sharing one. Alibaba are way ahead with their technology. They keep their lead for 1 year and then they share their work and make their code open source. data Artisans have a great long-term vision of stream processing. I can see a lot of very interesting architecture discussions in the coming months!”
Stream Processing Technology:
When most frameworks cannot process considerable streams of live data and provide results in real time, Flink provides a single runtime for the streaming and batch processing while being highly scalable.
Cyrille Duverne, our Lead Data Architect, confirms: “Flink is definitely a real-time processor! We’re speaking about true real time, not only mini batches etc… Plus, the introduction of ACID transaction management in the new version of data Artisans’ Flink distribution creates a good marketing edge”.
Sabri Skirhi and our R&D engineer Florian Demesmaeker were at the Spark Summit this week. Stay tuned for part 2 with their feedback!
Data Mining and ML Techniques Supporting TBS Concept Deployment
Our paper “Data Mining and Machine Learning Techniques supporting Time-based Separation Concept Deployment”, co-written with Eurocontrol and WaPT, has been accepted by the 37th Digital Avionics Systems Conference (DASC) in London, U.K.
The paper presents two methods to allow air traffic controllers to deliver separation minima accurately and safely, on the basis of time intervals instead of distances.
Importantly, in strong headwind conditions, the aircraft’s groundspeed during approach decreases, meaning that keeping the distance-based separation method results in lower landing rates. At a time of intensified air traffic, this situation leads to considerable delays at airports with significant costs to operators and travellers.
With the new methods presented in the paper, capacity can increase by up to 14% in strong wind conditions, and by up to 8% in moderate wind conditions.
The paper has been presented in September at DASC 2018. If you wish to go deeper into the subject, do not hesitate to contact our research department at research@euranova.eu.
The abstract
The Time-Based Separation (TBS) concept consists in the definition of separation minima for aircraft on the final approach to a runway based on time intervals instead of distances, as applied in Distance-Based Separation (DBS) operations.
TBS allows for dynamic distance separation reductions in strong headwind conditions so as to preserve time spacing across all wind conditions. However, TBS application entails the use of a support tool providing separation distance indicators depending on the applicable time separation minimum, the aircraft speed profile which also depends on the headwind conditions.
This paper details two methodologies allowing a system to compute those TBS indicators so as to allow Air Traffic Controllers to accurately and safely deliver the TBS minima using a separation delivery support tool. The first approach is based on “analytical” data mining and modelling whereas the second one is based on a Machine Learning (M/L) procedure.
In the framework of the deployment of the TBS concept in Vienna airport (LOWW), those approaches are developed and tested using a database covering one year of traffic and corresponding local meteorological data.
The operation of TBS with indicators computed using either approaches leads to substantial diminution of time separations compared to a DBS strategy. However, given the large uncertainties related both to leader and follower aircraft speed profiles, the buffers could be designed only for the most frequent pairs. With the M/L approach (resp. the “analytical” approach), the capacity benefits related to the application of TBS with a separation support tool are of the order of 8% (resp. 2%) in moderate wind conditions, and up to 14% (resp. 10%) in strong wind conditions.
De Visscher, I.; Stempfel, G.; Rooseleer, F. & Treve, V.; Data mining and Machine Learning techniques supporting Time-Based Separation concept deployment, in 37th Digital Avionics Systems Conference (DASC), pp 594-603, London, UK, September 23-27, 2018
Third Workshop on Real-time & Stream Analytics in Big Data
EURA NOVA Research center is proud and excited to organize the third workshop on Real-time and Stream analytics in Big Data, collocated with the 2018 IEEE conference on Big Data. The workshop will take place in December in Seattle, USA.
As the world become more connected, flood of digital data is getting generated, in high volume, and in a high velocity. For industries such as financial markets, telecommunications, Smart Cities, manufacturing, or healthcare, there is an increasing need to process, and analyze, these data streams in real time.
These past two years, we have seen arriving another usage of Stream & complex event processing: the data management. New architecture patterns have been proposed to resolve data pipeline and data management within enterprise.
After the success of the two first edition, this is an excellent opportunity to engage in discussions with experts and researchers, to refine new opportunities and use cases required by the industry.
Authors are invited to contribute to the conference by submitting articles in the (among others) following areas: Scalable real-time decision algorithms, IoT analytics & stream mining, Data pipelines & Data management with Streams and Stream ETL and Real-Time Data Warehouse.
Want to submit a paper? Check out the workshop website to find all the information you will need. Your paper will be reviewed by a prestigious panel of international experts from both the academic and the industrial worlds.
Graph BI & Analytics: Current State and Future Challenges
Our paper “Graph BI & Analytics: Current State and Future Challenges” has been accepted for publication at the 20th International Conference on Big Data Analytics and Knowledge Discovery, taking place in Regensburg, Germany.
The paper presents the state of the art of graph BI & analytics, with a focus on graph warehousing. We survey the topics of graph modelling, management, querying, and processing in graph warehouses. Then we conclude by discussing future research directions for solving complex graph problems, building native graph components and intelligent techniques to assist end-users in building and analysing the graph.
More importantly, the paper calls for the development of intelligent, efficient and industry-grade graph data warehousing systems to support the structure-driven management and analytics of data efficiently. While adopting a template that is similar to the traditional BI systems, the graph BI that is presented here extends current systems with graph analytics capabilities that deliver graph-derived insights.
The paper has been presented in September at DaWak 2018, you can now find the full version here. If you wish to go deeper into the subject, don’t hesitate to contact our research department at research@euranova.eu.
Abstract. In an increasingly competitive market, making well-informed decisions requires the analysis of a wide range of heterogeneous, large and complex data. This paper focuses on the emerging field of graph warehousing. Graphs are widespread structures that yield a great expressive power. They are used for modeling highly complex and interconnected domains, and efficiently solving emerging big data application. This paper presents the current status and open challenges of graph BI and analytics, and motivates the need for new warehousing frameworks aware of the topological nature of graphs. We survey the topics of graph modeling, management, processing and analysis in graph warehouses. Then we conclude by discussing future research directions and positioning them within a unified architecture of a graph BI & analytics framework.
Amine Ghrab, Oscar Romero, Salim Jouili, Sabri Skhiri, Graph BI & Analytics: Current State and Future Challenges. DaWaK 2018, 3-18