Anomaly Detection: How to Artificially Increase your F1-Score with a Biased Evaluation Protocol

Data science, Vadgelmir

June 30, 2021

Anomaly detection is a widely explored domain in machine learning. Many models are proposed in the literature, and compared through different metrics measured on various datasets.
The most popular metrics used to compare performances are F1-score, AUC and AVPR.
In this paper, we show that F1-score and AVPR are highly sensitive to the contamination rate.
One consequence is that it is possible to artificially increase their values by modifying the train-test split procedure.
This leads to misleading comparisons between algorithms in the literature, especially when the evaluation protocol is not well detailed.
Moreover, we show that the F1-score and the AVPR cannot be used to compare performances on different datasets as they do not reflect the intrinsic difficulty of modeling such data.
Based on these observations, we claim that F1-score and AVPR should not be used as metrics for anomaly detection. We recommend a generic evaluation procedure for unsupervised anomaly detection, including the use of other metrics such as the AUC, which are more robust to arbitrary choices in the evaluation protocol.

Damien Fourure*, Muhammad Usama Javaid*, Nicolas Posocco*, Simon Tihon*, Anomaly Detection: How to Artificially Increase your F1-Score with a Biased Evaluation Protocol, In Proc. of The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2021.

* equal contributions

Click here to access the paper.

Releated Posts

Evaluation of GraphRAG Strategies for Efficient Information Retrieval

24.12.2025 / Engineering / Papers

Traditional RAG systems struggle to capture relationships and cross-references between different sources unless explicitly mentioned. This challenge is common in real-world scenarios, where information is often distributed and interlinked, making graphs a more effective representation. Our work provides a technical contribution through a comparative evaluation of retrieval strategies within GraphRAG, focusing on context relevance rather than abstract metrics. We aim to offer practitioners actionable insights into the retrieval component of the GraphRAG pipeline.

Flight Load Factor Predictions based on Analysis of Ticket Prices and other Factors

22.12.2025 / Data science / Papers

The ability to forecast traffic and to size the operation accordingly is a determining factor, for airports. However, to realise its full potential, it needs to be considered as part of a holistic approach, closely linked to airport planning and operations. To ensure airport resources are used efficiently, accurate information about passenger numbers and their effects on the operation is essential. Therefore, this study explores machine learning capabilities enabling predictions of aircraft load factors.

Anomaly Detection: How to Artificially Increase your F1-Score with a Biased Evaluation Protocol

Releated Posts

Evaluation of GraphRAG Strategies for Efficient Information Retrieval

Flight Load Factor Predictions based on Analysis of Ticket Prices and other Factors

Recent Posts

Evaluation of GraphRAG Strategies for Efficient Information Retrieval

Flight Load Factor Predictions based on Analysis of Ticket Prices and other Factors

Investigating a Feature Unlearning Bias Mitigation Technique for Cancer-type Bias in AutoPet Dataset

Muppet: A Modular and Constructive Decomposition for Perturbation-based Explanation Methods

Tracks

Mjolnir

Rune

Vadgelmir

Yggdrasil

Field of expertises

Data architecture

Data governance

Data science

Engineering

Academic collaboration

SERVE

Expertise

CRAFT

digazu

CONTACT

Belgium

France

Tunisia

CAREER

Job Offers

Social media