In July, our R&D engineer Katherine Krasnoschok was in Melbourne, Australia to attend the ACL conference. She presented her poster on topic modelling. Her paper, co-written with Salim Jouili, indicates that involving more named entities positively influences the overall quality of topics.
News-related content has been extensively studied in both topic modeling research and named entity recognition. However, expressive power of named entities and their potential for improving the quality of discovered topics has not received much attention. In this paper, we use named entities as domain-specific terms for news-centric content and present a new weighting model for Latent Dirichlet Allocation. Our experimental results indicate that involving more named entities in topic descriptors positively influences the overall quality of topics, improving their interpretability, specificity and diversity.
A few weeks ago, Sabri Skhiri and Florian Demesmaeker were in London to attend the Spark+AI summit. They came back with a lot to say about the new features of Spark and the presented use cases! In this article, they will give you their opinion about Databricks’ main announcement, the intakes of their favourite talks and training, and what they thought of the new name of the conference.
A new name
This year, Spark expanded the summit’s scope and renamed it “Spark + AI Summit”. The goal of Databricks, announced by its co-founder Ali Ghodsi, is to incorporate unified aspects of data and AI.
Florian Demesmaeker, our R&D engineer, explains: “In some of the keynote talks, the speakers talked about use cases where the job of the data engineer is strongly reduced. The data scientists can easily experiment with data, travelling back and forth in time. This means more focus on AI, rather than on the data engineering part that makes all data accessible to the data scientists”.
In line with this change of name, Databricks announced the release of a complete data science lifecycle on the cloud.
Sabri Skhiri, our R&D Director, explains “It is interesting to see that the change in the event name is actually very visible in the change of Databricks’ strategy. Their tools are now completely dedicated to stream ETL, and there is a huge focus on integrated data management”.
Databricks’ new features include Databricks Delta which creates data pipeline and provides data views and exploration features. Secondly, the Databricks Runtime ML is a ready-to-use environment providing a set of pre-loaded ML frameworks where the data scientist can play with data. Finally, the MLflow tool allows to simplify the ML models development at enterprise scale.
Our R&D Director precises: “Together, these features provide a complete and unified approach to machine learning lifecycle and pipeline automation. This looks like a very competitive SaaS offer for integrated data management, available on AWS and Azure. However, the metadata management and the security aspect is still the missing piece”.
The training day
The first day of the conference was dedicated to training workshops that include a mix of instruction and hands-on exercises to help attendants improve their Apache Spark skills.
Florian gives insights into his favourite training Tuning and Best Practices. He explains: “The aim of the training was to make programmers aware of how Spark works internally, in order to be able to write optimised applications. They presented a few situations, each one showing one relatively slow process. Then they presented a step-by-step procedure to debug the situation and to find the points that could be improved in the current situation. In summary, tips and tricks to adapt to different situations”.
The sessions at the conference covered data engineering and data science contents along with best practices for productionising AI. The talks were divided into roughly two categories: Spark programming and deployment, and applications on top of Spark (AI applications).
Florian Demesmaeker explains: “I attended 28 talks. The keynotes from Databricks were quite interesting, they presented Delta and MLflow. I also enjoyed the talks about tools to optimise the internals of Spark, these provided good technical details. Other talks were about use cases on top of Spark, it was interesting to see what challenges other companies face and how they address them”.
Sabri Skhiri adds: “The talk Learning to Rank Datasets for Search was very inspiring. Oscar Castañeda-Villagrán, a data scientist working at Xoom (a Paypal service) talked about learning to rank R data set. The idea is that we can extract metadata when the data pipeline is arriving in the lake. Going further, you can not only extract metadata but also calculate a kind of judgment relevance score that will be used for bootstrapping the learning to rank process. In this way, a user can search and retrieve the relevant R data set in the lake. A very good idea for the metadata-driven exploration”.
Early September 2018, 8 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities. You can read their feedback here.
Early September 2018, 8 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities.
They came back with a lot to say about the hot topics in stream processing and the presented use cases! In this article, they will give you their opinion about data Artisans’ main announcement, the intakes of their favourite talks, and what they thought makes Flink Forward different from other conferences.
First keynote announcement:
During the keynote speech, data Artisans announced that they now bring ACID transactions directly on streaming data with data Artisans Streaming Ledger.
Charles Bonneau, our software architect, says: “This feature allows ACID transactions between multiple operators’ event-processing operations and internal states. This means that streaming applications can now update multiple states in one transaction. For example, an application that transfers money from one bank account to another can finally be implemented using Flink with strong consistency guarantees. Both bank accounts will have their balance updated at the same time as if there was a master data-management state”.
For Sabri Skhiri, our R&D director, this opens the doors to a brand new range of applications, especially in data-driven real-time services but also in streaming data management. He explains: “They are pushing forward the concept of streaming. Now, you could imagine a master data-management state that can be updated by operational streaming applications in real time. This will allow even more complex and advanced use cases of stream processing!”.
In 2 days, each Euranovian attended about 18 talks and use case presentations, with speakers from tech giants such as IBM, Netflix, Alibaba, and Uber as well as speakers from smaller companies.
Charles explains: “The conclusions are reassuring: most of them face the same issues that we see at our clients’ and our solutions are all valuable. They include a stream-first data architecture, a stream-first data pipeline product, and Flink developers skills. Even though a number of companies are at the very edge of the technology and their issues do not yet require continuous flows of a considerable amount of events, we are ready”.
For our R&D Director Sabri Skhiri, the keynote speech from Lightbend was one of the most interesting ones. He explains: “Viktor Klang, Lightbend deputy CTO, talked about the convergence between microservices and stream processing. At EURA NOVA, we have been advocating for this convergence for more than a year in our architecture practice. The idea is simple: asynchronous microservices can be designed as stream processing stages. This is fantastic because it makes modern stateful stream processing frameworks the perfect target for implementing reactive microservices. With stateful deployment, exactly once semantics, high availability and ACID access to states, microservices can become stateful streaming apps.”
Vision-oriented Flink Conference:
Our colleagues came back with sparkles in their eyes. When we asked them how they felt about the event, Sabri Skhiri explained:
“Very often, this type of conferences tend to be business oriented. They are focused on how to make the framework easy to use and available to as many people as possible. By contrast, this year’s Flink Forward conference was all about innovation and vision. data Artisans shared their vision of what the Flink framework will be within 3 to 5 years and talked about what role stream processing and big data have within this vision. In fact, almost all the talks were very technical. They were testimonies of big names in the industry, such as Alibaba, Netflix, and ING about problems encountered on the field and how they have been solved, which is often out of the box. The Flink-Alibaba partnership is a sharing one. Alibaba are way ahead with their technology. They keep their lead for 1 year and then they share their work and make their code open source. data Artisans have a great long-term vision of stream processing. I can see a lot of very interesting architecture discussions in the coming months!”
Stream Processing Technology:
When most frameworks cannot process considerable streams of live data and provide results in real time, Flink provides a single runtime for the streaming and batch processing while being highly scalable.
Cyrille Duverne, our Lead Data Architect, confirms: “Flink is definitely a real-time processor! We’re speaking about true real time, not only mini batches etc… Plus, the introduction of ACID transaction management in the new version of data Artisans’ Flink distribution creates a good marketing edge”.
Sabri Skirhi and our R&D engineer Florian Demesmaeker were at the Spark Summit this week. Stay tuned for part 2 with their feedback!
Our paper “Data Mining and Machine Learning Techniques supporting Time-based Separation Concept Deployment”, co-written with Eurocontrol and WaPT, has been accepted by the 37th Digital Avionics Systems Conference (DASC) in London, U.K.
The paper presents two methods to allow air traffic controllers to deliver separation minima accurately and safely, on the basis of time intervals instead of distances.
Importantly, in strong headwind conditions, the aircraft’s groundspeed during approach decreases, meaning that keeping the distance-based separation method results in lower landing rates. At a time of intensified air traffic, this situation leads to considerable delays at airports with significant costs to operators and travellers.
With the new methods presented in the paper, capacity can increase by up to 14% in strong wind conditions, and by up to 8% in moderate wind conditions.
The paper will be presented in September at DASC 2018, but you can already read the abstract below. If you wish to go deeper into the subject, do not hesitate to contact our research department at firstname.lastname@example.org.
The Time-Based Separation (TBS) concept consists in the definition of separation minima for aircraft on the final approach to a runway based on time intervals instead of distances, as applied in Distance-Based Separation (DBS) operations.
TBS allows for dynamic distance separation reductions in strong headwind conditions so as to preserve time spacing across all wind conditions. However, TBS application entails the use of a support tool providing separation distance indicators depending on the applicable time separation minimum, the aircraft speed profile which also depends on the headwind conditions.
This paper details two methodologies allowing a system to compute those TBS indicators so as to allow Air Traffic Controllers to accurately and safely deliver the TBS minima using a separation delivery support tool. The first approach is based on “analytical” data mining and modelling whereas the second one is based on a Machine Learning (M/L) procedure.
In the framework of the deployment of the TBS concept in Vienna airport (LOWW), those approaches are developed and tested using a database covering one year of traffic and corresponding local meteorological data.
The operation of TBS with indicators computed using either approaches leads to substantial diminution of time separations compared to a DBS strategy. However, given the large uncertainties related both to leader and follower aircraft speed profiles, the buffers could be designed only for the most frequent pairs. With the M/L approach (resp. the “analytical” approach), the capacity benefits related to the application of TBS with a separation support tool are of the order of 8% (resp. 2%) in moderate wind conditions, and up to 14% (resp. 10%) in strong wind conditions.
EURA NOVA Research center is proud and excited to organize the third workshop on Real-time and Stream analytics in Big Data, collocated with the 2018 IEEE conference on Big Data. The workshop will take place in December in Seattle, USA.
As the world become more connected, flood of digital data is getting generated, in high volume, and in a high velocity. For industries such as financial markets, telecommunications, Smart Cities, manufacturing, or healthcare, there is an increasing need to process, and analyze, these data streams in real time.
These past two years, we have seen arriving another usage of Stream & complex event processing: the data management. New architecture patterns have been proposed to resolve data pipeline and data management within enterprise.
After the success of the two first edition, this is an excellent opportunity to engage in discussions with experts and researchers, to refine new opportunities and use cases required by the industry.
Authors are invited to contribute to the conference by submitting articles in the (among others) following areas: Scalable real-time decision algorithms, IoT analytics & stream mining, Data pipelines & Data management with Streams and Stream ETL and Real-Time Data Warehouse.
Want to submit a paper? Check out the workshop website to find all the information you will need. Your paper will be reviewed by a prestigious panel of international experts from both the academic and the industrial worlds.
Our paper “Graph BI & Analytics: Current State and Future Challenges” has been accepted for publication at the 20th International Conference on Big Data Analytics and Knowledge Discovery, taking place in Regensburg, Germany.
The paper presents the state of the art of graph BI & analytics, with a focus on graph warehousing. We survey the topics of graph modelling, management, querying, and processing in graph warehouses. Then we conclude by discussing future research directions for solving complex graph problems, building native graph components and intelligent techniques to assist end-users in building and analysing the graph.
More importantly, the paper calls for the development of intelligent, efficient and industry-grade graph data warehousing systems to support the structure-driven management and analytics of data efficiently. While adopting a template that is similar to the traditional BI systems, the graph BI that is presented here extends current systems with graph analytics capabilities that deliver graph-derived insights. The paper will be presented in September at DaWak 2018, but you can already read the abstract bellow. If you wish to go deeper into the subject, don’t hesitate to contact our research department at email@example.com.
Abstract. In an increasingly competitive market, making well-informed decisions requires the analysis of a wide range of heterogeneous, large and complex data. This paper focuses on the emerging field of graph warehousing. Graphs are widespread structures that yield a great expressive power. They are used for modeling highly complex and interconnected domains, and efficiently solving emerging big data application. This paper presents the current status and open challenges of graph BI and analytics, and motivates the need for new warehousing frameworks aware of the topological nature of graphs. We survey the topics of graph modeling, management, processing and analysis in graph warehouses. Then we conclude by discussing future research directions and positioning them within a unified architecture of a graph BI & analytics framework.
The French branch of EURA NOVA will take part in two great tech events in the following days and weeks.
On the 22nd of February, data scientist Thomas Peel will give a talk titled “Machine Learning à l’ère du RGPD” (Machine learning and the General Data Protection Regulation) on the opening day of the Colloquium intelligence artificielle, machine learning, data science to be held at the grand amphitheatre of the Saint-Charles campus in Marseille. Other great speakers from INRIA, Google, Provence Innovation, and Criteo will be featured. The event is free but registration is mandatory.
What? Colloquium intelligence artificielle, machine learning, data science
When? Thursday 22nd of February
Where? Grand amphithéâtre, campus Saint-Charles, – 3, place Victor Hugo – case 39 – 13331 MARSEILLE Cedex 03
On the 12th of March, the French branch of EURA NOVA is organising the Marseille Community Event, supported by the Neo4j GraphTour. Two speakers are already announced: R&D project manager Cécile Péreaira will present a text-mining use case with Neo4j in biology, and data scientist Antoine Bonnefoy will sum up the Parisian Neo4j conference, from technology and business viewpoints. After the talks, all attendees will be offered a casual dinner to pursue the discussion.
What? Marseille Community Event – Neo4j GraphTour
When? Monday the 12th of March, from 6:30 PM to 8:30 PM
Where? Le Wagon, 167 Rue Paradis, Marseille