Last month, four EURA NOVA engineers travelled to Barcelona to attend the Dataworks Summit. The conference is organised by Hortonworks, now known as Cloudera and it is about how to apply open source Big Data technology to accelerate digital transformation initiatives. They came back with a lot to say about the hot topics in AI, machine learning, architecture, the cloud, and the use cases! In this article, they share with us what they learned there and what struck them as particularly useful.
Big Trends
Data architecture
This year, one of the most important trends at the conference was data management and data architecture. Our R&D director Sabri Skhiri says: “There was a real focus on taking data lakes to their next stage and on making them actionable for AI and machine learning. The notion of data hubs was often mentioned, notably during the keynote speeches by Cloudera, IBM, and Pure Storage. However, most of the vendors of platforms have not been able yet to provide a fully-fledged ecosystem that allows the exploration, governance, and industrialisation of big data”.
AI industrialisation
This brings us to the second motto of the conference: AI industrialisation is a must. Our data engineer Khalil Amdouni explains: “The conference has been migrating towards AI topics. In the past, the conference used to focus mostly on data ingestion and data processing. It has been moving towards data science. Everyone is talking about AI and machine learning and how to put data science models into production. It’s looking into how to move from data exploration to industrialisation; we heard a lot about Cloudera’s Data Science Workbench etc.”
Production environment
The third trend of the conference was the separation between data processing tools and AI frameworks. Khalil explains: “Spark, Cloudera, Kubernetes are now all providing production environments (data science management platforms such as Cloudera Data Science Workbench, the Databricks Runtime ML, Kubeflow…) to integrate with machine learning frameworks such as Tensorflow or Python. Sabri adds: “This is interesting but we should first speak about “productisation”, data science models lifecycles, continuous integration and delivery. There are still a lot of shortcomings, like the fact that you need to centralise all your data in one partition before starting your favourite AI framework”.
Data governance
Another hot topic of the conference was data governance and compliance with regulations. Our R&D director goes on to say: “Everybody is speaking about the importance to be GDPR compliant and is proposing tools like Atlas, Egeria, IBM Infosphere, … but no one says how to actually comply with the GDPR during model deployments or how to deal with access policy management.”
Favourite Talks
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Itai Yaffe presented the journey made by Nielsen’s Marketing Cloud division to provide its customers with real-time analytics tools to profile their target audiences. To achieve its goal, NMC needed to continuously transform its data infrastructure to ingest billions of events per day in a scalable and yet cost-efficient manner.
Sabri says: “The first version of NMC’s architecture includes CSV files and standalone Java applications with an OLAP database to expose the result. To reach their goal, NMC’s teams had to scale the process up to handle 10 times as much data”.
Their first step was to change the architecture: they moved to Kafka to ingest data, they leveraged Spark to stream and to aggregate data, and they used HDFS to store data.
Sabri explains: “The issue here was that they had to manage the statefulness of the Spark applications on HDFS by themselves. In addition, the system was error-prone in case of failure. They tried again and looked into Spark Structured Streaming, then tried to combine Spark Streaming with batch ETLs and finally decided to use Kafka to imitate streaming over their data lake. This evolution made the situation really interesting from a business and architectural point of view. Their business goal is to support decision making with machine learning to deliver reports on campaigns. Over the years, they adapted their architecture to go further and reach that objective”.
Our architect Cyrille Duverne adds: “Their story showed how much effort is required to build a long-term architecture. Tools are not enough; you first need the use cases that lead to an architectural vision. Only then can you choose the tools that will support the vision. To build this architecture, you need time and people with the right skills”.
To know more about NMC’s journey, you can find the slides of the presentation here.
Federated Learning
Chris Wallace is a data scientist at Cloudera Fast Forward Labs. He presented how his team leveraged federated learning to predict maintenance problems when customers of a manufacturer are not willing to share with the manufacturer the details of how their components failed, but want the manufacturer to provide them with a strategy to maintain the faulty parts.
Our architect Cyrille Duverne explains: “In this case, federated learning is a kind of distributed deep learning where you train the model on decentralised data. The main idea is that a network of nodes shares models rather than training data with the server. Each node has the untrained model that they will train using the data they have. Each node then sends a copy of its trained model back to the central server that will take the average and send the new model to the different nodes. The process is repeated until the final version of the model is reached.”
Our data scientist Malian De Ron explains: “I find federated learning very interesting. As data scientists, we can work directly on updating models, but we don’t have access to all the training data. Federated learning can be useful for use cases where the customers want to keep their data anonymous. For example, we work for a financial company that works with a bank. Neither of them is willing to share their data. By using federated learning, the training data could remain in its original location, which could satisfy our customer’s privacy concerns.
To know more about federated learning, you can find the slides of the presentation here.
Data governance with Egeria: The industry’s first open metadata standard
John Mertic is the director of program management for ODPi, the Linux Foundation’s Open Data Platform initiative. He talked about their new open metadata standard Egeria, introduced in September. John Mertic explained how the standard supports the free flow of standardised metadata between different technologies and vendor platforms, enabling organisations to locate, manage, and use their data resources more effectively.
Sabri says: ”Companies have 40 years of evolution embedded in their IT systems, resulting in high complexity of data lineage and data silos. In the complex new world of big data and real time, security models have to track data throughout the organisation. This is why data governance and metadata management are hot topics in conferences. Everybody is talking about it and proposes tools such as Egeria, IBM InfoSphere, or Atlas. I talked with IBM InfoSphere people and I had an overview of the Egeria tool. It can be used to federate the IBM InfoSphere Information Governance Catalog, Apache Atlas and even other Egeria cohorts. The IBM Governance Catalog can pull information directly from Egeria and integrate the metadata, the lineage, and even tags from Atlas”.
To know more about Egeria, please find the slides of the presentation here.
Final Thoughts
When working with clients as they make their journey to the new digital world, we noticed recurrent problems in the areas of data access, usage, and governance. In many conferences, we hear stories of companies facing these challenges and making a lot of ad hoc choices but lacking a long-term architectural vision. To crack the challenges, our R&D director Sabri Skhiri designed the Data Architecture Vision (DAV), which later led to digazu.
The Dataworks conference highlighted the need to take data lakes to their next stage. The digazu platform, with its integrated and managed data lake, meets that need. It is a true data hub that integrates real-time and batch dataflows, that collects data from multiple sources, stores it, and distributes it to applications and users across the whole organisation.
Another need mentioned at the conference was that of providing companies with production environments to deploy models. Leveraging ever-increasing amounts of data to provide new services or solve problems requires increasing resources in terms of expertise, time and money. digazu offers a scalable way to keep data pipelines open for business in real time or batches without an army of data experts, lines of code, or complex training.
A third need highlighted at the conference is for companies to reach good data governance. There are already excellent governance tools such as Atlas, Egeria, IBM Infosphere to support the free flow of standardised metadata. digazu opens the door to automated regulatory compliance by providing ready-to-use connectors to data management and governance tools.
To learn more about digazu, visit digazu.com
