Our CTO, Sabri Skhiri, recently travelled to Sorrento for IEEE Big Data 2023, a notable yearly gathering where experts delve into the intricacies of data science and AI. In this article, Sabri explores for you the various keynotes and talks that took place during the conference, highlighting the noteworthy insights and the practical applications shared by industry leaders:
The Big Trends
With over 1000 attendees, the conference underscored its shift towards data science, reflected in the keynote speeches and tutorials. Keynote speaker Jeffrey Ullman from Stanford University highlighted this trend by discussing fundamental distributed computing algorithms like LSH, approximate counting, and triangle counts. These topics, typically foundational in big data, seemed novel to a data science-oriented audience, indicating the conference’s evolving focus. At least novel enough to make it a keynote speech! A tutorial on scalable machine learning with massive data on the cloud using Kubernetes had a surprisingly low attendance., which further underscores this shift.
The conference spotlighted key areas in AI – deep learning, machine learning, and their applications across various sectors like healthcare, mobility, and industry. Sustainability also emerged as a key theme, with discussions on frugal AI and green IT.
Of particular interest was the discussion on robust AI models. For businesses, this translates into developing AI systems that can reliably handle real-world data complexities, like noisy labels and concept drift.
The growing focus on ethical AI, including fairness and uncertainty estimation, reflects a broader industry move towards responsible AI practices, a crucial aspect for maintaining public trust and regulatory compliance. At Euranova, we actively engage in similar discussions, notably through our sustainable AI and responsible AI research track. For those interested in delving deeper into the topic of responsible AI, I recommend our recent discussion with UAntwerpen expert Toon Calders, hosted at Vlerick Business School.
Another key trend was the rise in papers on online learning, particularly in stream processing, highlighting the industry’s response to managing real-time data. Finally, the industry players such as Meta, IBM, Microsoft, X and Yahoo! presented their works.
The Keynotes
Keynote 1/ Big-Data Algorithms That Are Not Machine Learning
In an enlightening keynote, Jeffrey Ullman (from Stanford University) delved into the complexities of managing billion-record datasets, emphasising distributed algorithms for machine learning. The talk highlighted two pivotal methods: Locality-Sensitive Hashing (LSH) and the Flajolet-Martin (FM) algorithm.
LSH, crucial for indexing large datasets, operates on the principle of hashing similar items into the same buckets, thus maximising collisions for similar items. This technique significantly reduces search space and is instrumental in applications like near-duplicate detection and similarity searches.
The FM algorithm, a probabilistic approach, estimates the cardinality of large datasets. It analyses the trailing zeros in hashed values of dataset elements, offering an efficient, memory-conserving method for estimating the number of unique elements in massive datasets.
A deep dive into the algorithms: LSH
Locality-Sensitive Hashing (LSH) is a method used in computer science for efficiently searching and indexing large datasets, particularly for tasks like near-duplicate detection, similarity search, and dimensionality reduction. The core principle of LSH is to hash input items in such a way that similar items map to the same “buckets” with high probability, while dissimilar items are less likely to do so.
How LSH Works:
- Hash Functions: LSH uses hash functions, but unlike traditional hash functions that aim to minimise collisions (different items mapping to the same hash value), LSH hash functions maximise collisions for similar items.
- Bucketing: Each item in the dataset is hashed several times using different hash functions, resulting in it being placed into several buckets. The intuition is that similar items will end up in the same buckets more often than dissimilar ones.
- Indexing: By creating an index of these buckets, the algorithm reduces the search space significantly. When querying, instead of comparing the query item with the entire dataset, you only compare with items in the same buckets.
- Tuning: The accuracy and efficiency of LSH depend on the choice of hash functions and the number of hash tables used. More hash tables increase the probability of detecting similar items but also increase the storage and computational cost.
Example of LSH in Practice:
Imagine you have a large database of images, and you want to find images that are similar to a given query image. Doing a pairwise comparison with every image in the database would be computationally expensive.
With LSH, you can hash each image using several hash functions designed to capture various aspects of the images (e.g., colour distribution, texture patterns, edge orientations). These hash functions will place each image into several buckets.
When a query image is received, it is also hashed using the same functions and placed into corresponding buckets. Instead of searching the entire database, you only search within these buckets to find similar images.
For example, if your query image is a beach scene, LSH might place it in buckets that contain images with similar colour distributions (like blues and tans) and textures (like waves or sand patterns). By only searching within these buckets, you can quickly find images with similar features.
Applications:
- Near-Duplicate Detection: In large datasets, LSH can help identify near-duplicate documents, images, or videos.
- Recommendation Systems: LSH can be used to find similar items or users in recommendation systems.
- Data Compression: By grouping similar items, LSH can be used to compress data or reduce dimensionality.
In summary, LSH provides an efficient way to index and search through large datasets by ensuring that similar items are more likely to be found quickly, reducing the computational burden of searching through the entire dataset.
A deep dive into the algorithms: approximate counting
The Flajolet-Martin (FM) algorithm is a probabilistic algorithm used to estimate the number of distinct elements (cardinality) in a large dataset, with a single pass and using limited memory. This algorithm is particularly useful when dealing with massive datasets where a precise count is either impossible or impractical due to memory constraints.
How the Flajolet-Martin Algorithm Works:
Hashing Elements: Each element in the dataset is hashed to a binary string using a uniform hash function.
- Observing Trailing Zeros: For each hashed value, the algorithm observes the number of trailing zeros in its binary representation. For instance, if the binary hash of an element is `1100100`, the number of trailing zeros is `2`.
- Maintaining Maximum: The algorithm maintains the maximum number of trailing zeros (`R`) seen so far across all elements.
- Estimating Cardinality*: The estimate of the number of distinct elements is then given by \(2^R\). This estimate is based on the probabilistic assumption that with a uniform hash function, a binary string with `R` trailing zeros appears with a probability of \(1/2^{R+1}\).
Example Use-Case:
Imagine you’re managing a website, and you want to estimate the number of unique visitors per day. Tracking each unique visitor exactly would require substantial memory if the number of visitors is very large.
Using the Flajolet-Martin algorithm, each visitor’s unique identifier (like an IP address) is hashed. The algorithm observes the number of trailing zeros in each hash value and keeps track of the maximum number found (`R`). At the end of the day, the estimated number of unique visitors is calculated as 2^R.
Benefits and Limitations:
- Efficiency: The FM algorithm is highly memory-efficient, making it suitable for streaming data or extremely large datasets.
- One Pass: It requires only a single pass through the data.
- Approximation: The trade-off for efficiency is precision. The FM algorithm provides an estimate, not an exact count. The accuracy can be improved by averaging the results of multiple FM estimators (using different hash functions).
In summary, the Flajolet-Martin algorithm provides an efficient probabilistic method to estimate the number of distinct elements in a dataset, especially useful in situations where exact counts are impractical due to memory or computational constraints.
Keynote 2/ Have Your Data and Share It Too: Addressing Privacy in Healthcare Data Sharing
Garofalakis Minos, the second keynote speaker, focused on the challenges and solutions in sharing healthcare data while preserving privacy. He emphasised the growing volume of healthcare data, which is increasing at a compound annual growth rate of 16%. With this growth comes heightened sensitivity and privacy concerns, highlighted by rising numbers of data breaches and fines.
Privacy Challenges
A key issue identified is the lack of clarity around privacy in IT requirements. Understanding and legally sharing private healthcare data remains a complex challenge. Traditional anonymisation methods, like removing personally identifiable information (PII), are proving insufficient, as demonstrated by incidents like the Massachusetts Governor’s breach.
Note: The Massachusetts Governor’s breach refers to a data security incident involving UMass Chan Medical School, affecting over 134,000 individuals enrolled in certain state programs. The breach was part of a global security incident involving the MOVEit file-transfer software program, impacting various organisations including government agencies and financial services firms. This incident did not compromise UMass Chan or state systems but involved the potential unauthorised acquisition of personal information due to a MOVEit security flaw. The affected individuals, primarily participants in various state health and elder care programs, were notified and offered free credit monitoring and identity theft protection services.
Anonymisation in Healthcare
Sharing healthcare data, especially for rare diseases, requires federating data across hospitals while adhering to legal and ethical guidelines. The talk critiqued naive anonymisation approaches and stressed the importance of considering quasi-identifiers.
Privacy-Enhancing Technologies (PETs)
Garofalakis Minos delved into PETs, including:
- Federated Learning: Involves training models across decentralised devices without the need for data exchange.
- Secure Multi-Party Computation (SMPC): Enables joint computation of functions without revealing individual inputs.
- Differential Privacy (DP): Ensures the privacy of an algorithm’s output, rather than the data itself, even in contexts such as deep learning where models might memorise training data details. Solutions like differentially private Stochastic Gradient Descent (SGD) add noise to gradients to maintain privacy.
- Local Differential Privacy: Enhances federated learning by adding local randomisation to updates before their aggregation.
- Distributed Ledger: Leveraging the blockchain to guarantee privacy.
Federated Learning:
Federated Learning is an approach in data privacy, particularly relevant in healthcare. It involves training machine learning models across multiple decentralised devices or servers containing local data, without data exchange. This enables collaborative learning while ensuring data privacy. In healthcare, this approach is instrumental in enhancing predictive models across hospitals, utilising shared models trained on data from all hospitals, but without sharing the sensitive patient data. This not only safeguards privacy but also complies with stringent regulations like HIPAA.
Secure Multi-Party Computation (SMPC):
SMPC is a cryptographic method enabling multiple parties to jointly compute a function over their data without revealing the actual data. In healthcare, this could mean hospitals collaboratively analysing patient data for research without exposing individual patient records. For example, hospitals could jointly compute the average treatment effectiveness across various demographics without sharing sensitive patient data. SMPC’s distributed computation ensures no single entity possesses the complete dataset, maintaining data confidentiality.
Differential Privacy (DP):
Differential Privacy ensures privacy preservation in data analysis. In the context of healthcare, DP can be applied to algorithms analysing patient data, ensuring that the output of these algorithms doesn’t compromise individual privacy. This is particularly crucial when dealing with deep learning models that have the capacity to memorise fine details of training data, leading to potential privacy breaches. Techniques like differentially private Stochastic Gradient Descent (SGD) add Gaussian noise to gradients during training, providing a balance between data utility and privacy.
Local Differential Privacy in Federated Learning:
Enhancing federated learning, local differential privacy involves adding noise to data or model updates locally before they are sent to a central aggregator. In a healthcare scenario, this would mean that individual patient data or insights derived from this data are ‘noised’ before being sent for aggregation in a central model, ensuring that patient privacy is maintained even if the central server is compromised.
Synthetic Data:
The keynote addressed the potential of synthetic data as a solution for sharing while preserving privacy. The technique PrivBayes, which uses Bayesian networks to generate privacy-preserving synthetic data, was highlighted. It maintains the statistical properties of original data, balancing utility and privacy.
For further reading, the following resources were cited:
– [Defeating Image Obfuscation with Deep Learning](https://www.researchgate.net/publication/307604221_Defeating_Image_Obfuscation_with_Deep_Learning)
– [PrivBayes: Private Data Release via Bayesian Networks](http://dimacs.rutgers.edu/~graham/pubs/papers/privbayes-tods.pdf)
Favourite Talks
Audience Prospecting for Dynamic-Product-Ads in Native Advertising (Yahoo!)
The paper presents two novel audience prospecting methods for Dynamic-Product-Ads (DPA) in native advertising, specifically within the Yahoo Gemini marketplace.
Problem statement: the paper focuses on the challenge of finding and expanding the right audience for each Dynamic-Product-Ad (DPA) in the Yahoo Gemini marketplace. Traditional approaches like targeting users who have visited the advertisers’ websites (Retargeting), users who searched for specific products (Search-Prospecting), or users in preferred locations (Location-Prospecting) are seen as having limited capabilities for audience expansion.
- Conversion-Prospecting: This method predicts DPA conversion rates using logged data to calculate expected cost-per-action (CPA), which is then used to optimise DPA bids in auctions. The goal is to maintain performance goals while expanding the audience.
- Trending-Prospecting: This approach focuses on matching trending products to users by analysing their engagement with products on advertisers’ sites. It helps in identifying popular products and user preferences, particularly beneficial for new advertisers and products.
The paper demonstrates that these approaches resulted in significant lifts in DPA delivery and revenue while maintaining CPA within target ranges. The methods are now in production, serving all Gemini native traffic and accounting for a substantial portion of DPA prospecting traffic. The paper’s contributions include providing a comprehensive overview of the Yahoo Gemini DPA system and introducing these new prospecting types, which have shown effectiveness in web-scale testing.
An Adaptive Hierarchical Method for Anytime Set-wise Clustering of Variable and High-Speed Data Streams BigD525
Problem Addressed:
The existing set-wise clustering methods struggle with variable and high-speed data streams, leading to accuracy issues. Set-wise clustering is important in various contexts like retail chain clustering, community clustering based on text, and restaurant categorisation, where it’s more meaningful to group sets of objects based on distribution patterns rather than individual objects.
Key Contributions
- Handling Variable Inter-Arrival Rates: ANY SET CLUS can effectively manage the variable inter-arrival rates of stream objects, which is a challenge for existing methods.
- AnySetClusTree Indexing Structure: It introduces a new indexing structure called AnySetClusTree. This structure stores a hierarchy of micro-clusters of multi-set entities at varying granularity, adapting to the stream’s flow.
- Incremental Model Updates: The method supports incremental updates to the clustering model, making it highly adaptive to changes in the data stream.
- Outlier Segregation and Transition: It can segregate outliers and enable their transition to concepts, capturing concept drift effectively.
- Anytime Offline Clustering: ANY SET CLUS enables anytime offline clustering, where it can generate clusterings of varying granularity and purity based on the available time for clustering. This feature is particularly useful for managing high-speed data streams.
Effectiveness:
The experimental results demonstrate the superior efficacy of ANY SET CLUS in handling variable and high-speed streams compared to existing methods. It achieves higher micro-cluster purity for both low and high-speed streams and proves effective in any time offline clustering, a capability not present in the state-of-the-art method.
In conclusion, ANY SET CLUS addresses the limitations of previous set-wise clustering methods, particularly in handling variable and high-speed data streams, by introducing an indexing structure and supporting features like incremental model updates and anytime clustering. This makes it a robust solution for real-world applications where data streams vary in speed and require flexible, adaptive clustering approaches.
Note: the fingerprint vectors are initially computed on stream or in batch?
The ANYSETCLUS method in the paper involves an initial setup phase where the fingerprint vectors for the entities are created. This setup is done using an initial sample of data from the dataset, where k-means clustering is applied to this sample to determine anchor points. The fingerprint vectors are then constructed based on the distribution patterns of data objects around these anchor points. The initial setup of the fingerprint vectors seems to be done in an online manner with this initial sample, and once set up, the model can then be updated incrementally as new stream data arrives. This initial phase is important for establishing a baseline clustering model, which is then refined and updated incrementally with incoming stream data.
Leveraging Large Language Models for Structure Learning in Prompted Weak Supervision BigD476
Problem Addressed:
Scarcity of Labelled Data in Supervised Learning: Many machine learning applications are hindered by a lack of large, labelled datasets, which are costly and time-consuming to produce.
Challenges in Weak Supervision: Weak supervision, a method that uses multiple noisy labelling functions (LFs) to generate probabilistic labels, face difficulties in managing the dependency structures among these functions. This issue is exacerbated in Prompted Weak Supervision (PromptedWS), where pre-trained large language models (LLMs) are used to generate labels. The LFs in this setup can be strongly correlated due to model biases, impacting labelling performance.
Difficulties in Structure Learning: Learning the dependency structure among LFs in a weak supervision context, especially when these functions are derived from LLMs, presents significant challenges. Existing methods struggle with this, especially in the PromptedWS setup.
Key Idea of the Contribution:
Structure Refining Module: The paper introduces a Structure Refining Module as a novel approach to address the dependency structure challenges in PromptedWS. This module uses the intrinsic structure in the embedding space of the LLMs to refine the structure of labelling functions.
- Two Core Components – LaRe and CosGen: The module consists of two main components: Labelling Function Removal (LaRe) and Correlation Structure Generation (CosGen). LaRe detects and removes redundant prompted LFs to reduce biases and computational costs. CosGen efficiently learns the dependency structures among prompted LFs using their vector embeddings, without relying on labels or unlabelled data.
- Improvements in PromptedWS Pipeline: The Structure Refining Module enhances the PromptedWS pipeline significantly in terms of accuracy and computational efficiency. The paper demonstrates improvements of up to 12.7 points on benchmark tasks over baseline methods.
- Comprehensive Analysis: The paper also includes extensive ablation studies and analysis, validating the assumptions behind the proposed method and analysing the contributions of each component.
- In essence, this paper presents an innovative approach to refining the structure of labelling functions in Prompted Weak Supervision by leveraging the similarity in embeddings of LFs, thereby addressing key challenges in weak supervision and improving both the efficiency and effectiveness of the labelling process.
https://github.com/BatsResearch/su-bigdata23-code
LaRe
Labelling Function Removal operates by identifying and removing redundant prompted Labelling Functions (LFs) in a Prompted Weak Supervision framework. The process can be summarised as follows:
Analysing the Similarity Matrix: LaRe begins by examining a similarity matrix among the prompted LFs set. This matrix is crucial for identifying potential redundancies among LFs. A higher concentration of similarities in this matrix suggests greater correlations or the possibility of redundant LFs.
Deciding Number of LFs to Remove: The user decides how many LFs to remove based on the analysis of the similarity matrix. A larger set of LFs often indicates the potential for more redundancies, thus necessitating the removal of more LFs.
Removing Redundant LFs: After determining the number of LFs to be removed, the process identifies potentially redundant LFs based on their similarities. The pairwise similarities for each LF pair are ranked, and in each step, the LF pair with the highest similarity is identified. Then, the LF with the larger index in this pair is removed. This approach helps to keep the algorithm deterministic, ensuring consistent results.
The goal of LaRe is to reduce computational needs while maintaining the accuracy of labelling. By removing redundant LFs, it seeks to streamline the labelling process, ensuring that the remaining LFs are diverse and contribute effectively to the labelling task.
What is the weak supervision ?
Weak supervision is an approach in machine learning where the training data is labelled using noisy, imprecise, or less reliable sources, as opposed to traditional supervised learning that relies on a large amount of accurately labelled data. This method is particularly useful when obtaining a large set of perfectly labelled data is too costly, time-consuming, or difficult.
Characteristics of Weak Supervision:
- Noisy Labels: Labels generated under weak supervision are often noisy or imprecise, meaning they are not always correct.
- Multiple Sources: It often involves aggregating labels from multiple sources, each of which might be unreliable or biased in its own way.
- Efficiency: The key advantage is efficiency, as it allows for the use of unlabelled data or easily obtainable but imperfect labels.
Example of Weak Supervision:
Imagine you’re trying to build a machine learning model to classify news articles into categories like “Sports,” “Politics,” and “Technology.” However, manually labelling a large dataset of news articles is time-consuming and expensive. Instead, you decide to use weak supervision. Here’s how you might proceed:
- Use of Keywords: You write simple rules that label an article based on the presence of certain keywords. For instance, any article containing words like “election” or “parliament” is labelled as “Politics,” and those with “football” or “Olympics” are labelled as “Sports.” However, these rules are not perfect. An article about a politician attending a football game might be incorrectly labelled as “Politics” instead of “Sports.”
- Aggregating from Multiple Sources: You also scrape labels from various websites that have some form of categorisation but are known to be inaccurate occasionally.
- Combining Labels with a Model: You then use a model to combine these noisy labels, trying to estimate the most likely true label for each article. This model accounts for the known inaccuracies and biases of your labelling functions and sources.
The goal is to train a classifier with this weakly supervised data, accepting that the labels are not perfect. While the model might not be as accurate as it would have been with a fully labelled dataset, weak supervision provides a practical way to train models when perfect labels are not available.
BigD555 Striking a Balance in Fairness for Dynamic Systems Through Reinforcement Learning" (Regular)
The paper presents a new approach to integrating fairness into sequential decision-making systems using Reinforcement Learning (RL). The key contributions include:
- Proposing a Framework: The paper introduces an algorithmic framework that addresses both short-term and long-term fairness in dynamic systems. This framework is based on Markov Decision Process (MDP) and utilises a combination of pre-processing and in-processing approaches for fairness integration.
- Methodology: The authors adopt a method combining action massaging for short-term fairness and advantage regularisation for long-term fairness.
- Case Studies: Three case studies are provided to demonstrate the effectiveness of the proposed method. These studies focus on diverse scenarios like bank loans, attention allocation, and epidemic control, offering practical insights into the implementation of the framework.
- Findings: The results indicate that the proposed method can successfully balance short-term fairness, long-term fairness, and utility in sequential decision-making systems.
In the context of the paper, Reinforcement Learning (RL) is interesting because it addresses the dynamic nature of fairness in decision-making systems. Fairness in real-world scenarios is not static; it evolves with time and context. RL allows the system to learn and adapt its fairness strategies, ensuring that the system remains fair and relevant over time. This adaptability is key in maintaining both short-term and long-term fairness in dynamic systems, which is the primary focus of the paper.
The paper includes a case study on bank loans to explain decision-making in a sequential context. In this scenario, a simulated bank lending environment is created where an agent assumes the role of a bank. The agent must make decisions about whether to grant loans to a stream of applicants. Each applicant’s qualification is described by a discrete credit score, which is subject to change based on the loan decisions made by the bank. At each time step, the bank observes a loan applicant, who is selected from a pool of applicants. This case study serves as a practical example of how sequential decision-making can be applied in a real-world scenario, illustrating the dynamic nature of decision-making and the potential impacts of these decisions on applicant credit scores.
Editor’s question: Ethical question: if the minority groups are correlated with failure of paying credit (which can be the case for poorer groups), how this kind of model should behave? Answer: No IDEA ! The path to ethical AI and transparency is still long…
Favourite Tutorials
This tutorial, with all slides available here, delved into the requirements and challenges of ensuring Trustworthy AI. Key challenges included the ability to explain decisions and predict accurately on novel data, such as anomalies and new distributions (in computer vision contexts)
Addressing novel data integration poses difficulties since this data may not be available during training or might represent an entirely new distribution.
The full notes are available here.
Part 1: Solutions at Inference Stage
The tutorial outlined steps to utilise information from novel data, focusing on:
- Assessing the novel data in relation to the training distribution.
- Extracting information from this novel data.
- Techniques utilising this new information.
The Fisher Information metric was proposed for understanding the variation in likelihood functions, indicating how well a new input fits the distribution. Techniques like GradCAM were discussed for feature attribution in network classifications, leveraging gradients to gather local distribution information.
Part 2: Visual and Gradient Explanations
The tutorial highlighted the importance of visual explanations in understanding decision-making, categorising them into observed correlations, counterfactual observations, and contrastive observations. These methods help assess the network’s knowledge by exploring different scenarios.
Gradient explanations, using tools like GradCAM, provide pixel-level importance scores, aiding in understanding the model’s decision-making process. The tutorial referenced papers that offer deeper insights into contrastive explanations and their significance in neural networks.
Part 3: Uncertainty and Detection Techniques
The concept of uncertainty was explored, particularly epistemic uncertainty, which stems from a lack of knowledge. The tutorial proposed methods to quantify prediction uncertainty within a single network pass.
Two techniques were highlighted:
- Back-propagation confounding labels for adversarial detection: A method for enhancing neural networks’ ability to detect anomalies, using gradient information to distinguish between normal and unusual inputs.
- Back-propagation confounding labels for robust prediction: This involves a two-stage neural network architecture, consisting of an initial decision stage and a reflection stage that reassesses decisions based on gradient information.
The tutorial also discussed the importance of model calibration, ensuring that a model’s confidence in its predictions aligns with actual outcomes.
Editor’s Note: The techniques presented, such as GradCAM for explanations and the novel approaches for uncertainty and adversarial detection, align well with metrics for responsible AI. These include providing various forms of explanations (‘Why P?’, ‘What if not P?’, ‘Why P, rather than Q?’), expected calibration error, and creating classifiers for detecting new information and monitoring model behaviour. These approaches are crucial for advancing the field of Trustworthy AI and ensuring responsible AI implementations.