Spark+AI Summit: a summary

A few weeks ago, Sabri Skhiri and Florian Demesmaeker were in London to attend the Spark+AI summit. They came back with a lot to say about the new features of Spark and the presented use cases! In this article, they will give you their opinion about Databricks’ main announcement, the intakes of their favourite talks and training, and what they thought of the new name of the conference.

 

A new name

This year, Spark expanded the summit’s scope and renamed it “Spark + AI Summit”. The goal of Databricks, announced by its co-founder Ali Ghodsi, is to incorporate unified aspects of data and AI.

Florian Demesmaeker, our R&D engineer, explains: “In some of the keynote talks, the speakers talked about use cases where the job of the data engineer is strongly reduced. The data scientists can easily experiment with data, travelling back and forth in time. This means more focus on AI, rather than on the data engineering part that makes all data accessible to the data scientists”.

 

Main announcement

In line with this change of name, Databricks announced the release of a complete data science lifecycle on the cloud.

Sabri Skhiri, our R&D Director, explains “It is interesting to see that the change in the event name is actually very visible in the change of Databricks’ strategy. Their tools are now completely dedicated to stream ETL, and there is a huge focus on integrated data management”.

Databricks’ new features include Databricks Delta which creates data pipeline and provides data views and exploration features. Secondly, the Databricks Runtime ML is a ready-to-use environment providing a set of pre-loaded ML frameworks where the data scientist can play with data. Finally, the MLflow tool allows to simplify the ML models development at enterprise scale.

Our R&D Director precises: “Together, these features provide a complete and unified approach to machine learning lifecycle and pipeline automation. This looks like a very competitive SaaS offer for integrated data management, available on AWS and Azure. However, the metadata management and the security aspect is still the missing piece”.

 

The training day

The first day of the conference was dedicated to training workshops that include a mix of instruction and hands-on exercises to help attendants improve their Apache Spark skills.

Florian gives insights into his favourite training Tuning and Best Practices. He explains: “The aim of the training was to make programmers aware of how Spark works internally, in order to be able to write optimised applications. They presented a few situations, each one showing one relatively slow process. Then they presented a step-by-step procedure to debug the situation and to find the points that could be improved in the current situation. In summary, tips and tricks to adapt to different situations”.

 

Favourite talks

The sessions at the conference covered data engineering and data science contents along with best practices for productionising AI. The talks were divided into roughly two categories: Spark programming and deployment, and applications on top of Spark (AI applications).

Florian Demesmaeker explains: “I attended 28 talks. The keynotes from Databricks were quite interesting, they presented Delta and MLflow. I also enjoyed the talks about tools to optimise the internals of Spark, these provided good technical details. Other talks were about use cases on top of Spark, it was interesting to see what challenges other companies face and how they address them”.

Sabri Skhiri adds: “The talk Learning to Rank Datasets for Search was very inspiring. Oscar Castañeda-Villagrán, a data scientist working at Xoom (a Paypal service) talked about learning to rank R data set. The idea is that we can extract metadata when the data pipeline is arriving in the lake. Going further, you can not only extract metadata but also calculate a kind of judgment relevance score that will be used for bootstrapping the learning to rank process. In this way, a user can search and retrieve the relevant R data set in the lake. A very good idea for the metadata-driven exploration”.

 

 

Early September 2018, 8 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities. You can read their feedback here.

Flink Forward 2018: What You Want to Know and What You (Will) Need to Know.

Early September 2018, 8 EURA NOVA engineers travelled to Berlin to attend the Flink Forward Conference, dedicated to Apache Flink users and stream processing communities.

They came back with a lot to say about the hot topics in stream processing and the presented use cases! In this article, they will give you their opinion about data Artisans’ main announcement, the intakes of their favourite talks, and what they thought makes Flink Forward different from other conferences.

 

First keynote announcement:

During the keynote speech, data Artisans announced that they now bring ACID transactions directly on streaming data with data Artisans Streaming Ledger.

Charles Bonneau, our software architect, says: “This feature allows ACID transactions between multiple operators’ event-processing operations and internal states. This means that streaming applications can now update multiple states in one transaction. For example, an application that transfers money from one bank account to another can finally be implemented using Flink with strong consistency guarantees. Both bank accounts will have their balance updated at the same time as if there was a master data-management state”.

For Sabri Skhiri, our R&D director, this opens the doors to a brand new range of applications, especially in data-driven real-time services but also in streaming data management. He explains: “They are pushing forward the concept of streaming. Now, you could imagine a master data-management state that can be updated by operational streaming applications in real time. This will allow even more complex and advanced use cases of stream processing!”.

 

Favourite talks:

In 2 days, each Euranovian attended about 18 talks and use case presentations, with speakers from tech giants such as IBM, Netflix, Alibaba, and Uber as well as speakers from smaller companies.

Charles explains: “The conclusions are reassuring: most of them face the same issues that we see at our clients’ and our solutions are all valuable. They include a stream-first data architecture, a stream-first data pipeline product, and Flink developers skills. Even though a number of companies are at the very edge of the technology and their issues do not yet require continuous flows of a considerable amount of events, we are ready”.

For our R&D Director Sabri Skhiri, the keynote speech from Lightbend was one of the most interesting ones. He explains: “Viktor Klang, Lightbend deputy CTO, talked about the convergence between microservices and stream processing.  At EURA NOVA, we have been advocating for this convergence for more than a year in our architecture practice. The idea is simple: asynchronous microservices can be designed as stream processing stages. This is fantastic because it makes modern stateful stream processing frameworks the perfect target for implementing reactive microservices. With stateful deployment, exactly once semantics, high availability and ACID access to states, microservices can become stateful streaming apps.”

 

Vision-oriented Flink Conference:

Our colleagues came back with sparkles in their eyes. When we asked them how they felt about the event, Sabri Skhiri explained:

“Very often, this type of conferences tend to be business oriented. They are focused on how to make the framework easy to use and available to as many people as possible. By contrast, this year’s Flink Forward conference was all about innovation and vision. data Artisans shared their vision of what the Flink framework will be within 3 to 5 years and talked about what role stream processing and big data have within this vision.  In fact, almost all the talks were very technical. They were testimonies of big names in the industry, such as Alibaba, Netflix, and ING about problems encountered on the field and how they have been solved, which is often out of the box. The Flink-Alibaba partnership is a sharing one. Alibaba are way ahead with their technology. They keep their lead for 1 year and then they share their work and make their code open source. data Artisans have a great long-term vision of stream processing. I can see a lot of very interesting architecture discussions in the coming months!”

 

Stream Processing Technology:

When most frameworks cannot process considerable streams of live data and provide results in real time, Flink provides a single runtime for the streaming and batch processing while being highly scalable.

Cyrille Duverne, our Lead Data Architect, confirms: “Flink is definitely a real-time processor! We’re speaking about true real time, not only mini batches etc… Plus, the introduction of ACID transaction management in the new version of data Artisans’ Flink distribution creates a good marketing edge”.

Sabri Skirhi and our R&D engineer Florian Demesmaeker were at the Spark Summit this week. Stay tuned for part 2 with their feedback!

Installing TensorFlow with distributed GPU support.

Today, I wrote my first “Hello World” script using the freshly open-sourced version of TensorFlow with distributed GPU support. At the time of this writing, the binary releases of TensorFlow don’t come with the distributed GPU support therefore I had to build TensorFlow from sources. All the documentation to do this already exists but is a bit scattered on multiple websites. Here is a condensed version of the install process (on a Linux Ubuntu 14.04 platform).

Continue reading

Graph Data Management: Status and Trends

Today’s social environments are getting more interconnected and the business market is becoming increasingly open and competitive. Organisations require a better awareness of their state and an accurate prediction of their evolution. To cope with this surging demand, new models and tools need to be developed. In my opinion, graph models are of a crucial interest for addressing these challenges.

Continue reading

High Availability in RoQ

In the last year, we have worked with Benjamin Van Melle on implementing High Availability in RoQ, our proof-of-concept distributed pub-sub messaging system. As a consequence, we needed to expand our JUnit tests to cover individual component failure scenarios and prove they were handled as expected. This piece will show how we used Docker to achieve this.

Elastic Messaging for the Cloud
Elastic Messaging for the Cloud

Continue reading

EURA NOVA Master Theses, 2013-2014 Season

Similar to the previous years EURA NOVA R&D has been supervising Master students either for their internship or for their Master Thesis during the 2013-1014 academic year. This year, 5 students have had the opportunity to work in the fields of Machine Learning, GPU compute, distributed processing, metabolic pathways and social graphs. This blog post summarizes their breakthroughs.

EURA NOVA Master Theses

Continue reading

Projects Portfolio Management (PPM)

EURA NOVA - Projects Portfolio Management
(c) Can Stock Photo / Venimo

For meeting its strategic objectives, a company has to move from a situation A to a situation B. To achieve this move, it uses Projects as vehicles for performing Changes.

As you know, vehicles have specific purposes and therefore different sizes, styles, performance, costs, lifespan and sometimes also priorities.

As you also know, vehicles sometimes don’t arrive in destination B as initially foreseen: some may get out of fuel, others may use unexpected directions, or take more time than planned for arriving to destination. In worst cases, they may get damaged or even collide each others.

Let’s see how Projects Portfolio Management can select the right mix of vehicles, control them and increase success for vehicles to arrive all at the desired place, time & cost and without bad surprises.

Continue reading