Data storage elasticity – quick view on master thesis work (part 1)

Data science

March 29, 2011

Master Theses

In this post I would like to speak about two master theses that EURA NOVA is managing with the Faculty of Science Engineering of the Université Libre de Bruxelles (ULB) and with the Université Catholique de Louvain (UCL). The two students have been working on the same topic: the elasticity of data storage on the cloud. The first cool stuff to notice is that they are working on two different aspects of the elasticity by taking different directions, but at the end of the day, by their two contributions they draw a complete picture of the NoSQL benchmarking in the cloud. In this post I will give you a preview of their work that should be published in June 2011.

The subject is the study of the elasticity on the cloud for data storage. As you could have guessed, their investigations have quickly been oriented towards NoSQL storages. The objectives of the work are (1) studying the data storage options on the cloud, (2) defining KPIs describing the elastic nature of storage, (3) testing and benchmarking them, and finally (4) defining a taxonomy of storage in the cloud. In this post, I will focus on the second objective for NoSQL.

Two directions

Both students have to measure 3 main KPIs:

Raw performance, in terms of CPU usage, memory print and latency.
Scalability: the ability to keep CPU usage, memory print and latency acceptable while increasing the load.
Elasticity, the most complex KPI to define.

The first student, Nicolas Degroodt [1], has defined the notion of elasticity as the response of the system in order to adapt itself to the load by adding or removing instances. Then, he has described a criteria in order to measure this response and proposed to evaluate the elasticity according to the way the system behaves. Indeed, the adding or removal of instances can introduce several re-configuration in NoSQL that can lead to the increase of response time or even the unavailability of the storage. Nicolas defined this notion as the concept of “absorption” of the system when new instances are added or removed. So great, we have the criteria, now we need to have a test. Nicolas has preferred to look at the existing literature concerning storage benchmarking, such as the TPC-C benchmark, and adapted it to the NoSQL concept. As a result Nicolas’s work has been directed more towards the benchmark side by taking the time to adapt a benchmarking method and a framework, and by carefully defining measures and KPIs.

The second student, Thibault Dory [2], has defined elasticity as the time needed for the system to stabilize when instances are removed or added. In order to quantify this time, Thibault computes the standard deviation of the response time. Actually he measures the standard deviation on a stable cluster, adds a node and then waits for the standard deviation to reach the same value as before. Then, he considers that the system is stable. The step-by-step methodology can be found here. In the test plan he has chosen the whole English wikipedia content (28GB) as a base for the execution of a word research with Map/Reduce. The test results and methodology can be found here. Thibault’s work has been clearly oriented by the comparison between existing NoSQL systems and by doing a great job in benchmarking HBase, MongoDB, Cassadandra, Voldemort and Riak.

Figure 1: The first benchmark results from (3) obtained by applying a M/R request for the search of a text within the wikipedia content.

Figure 2 represents the usual behavior for an elastic storage. At t1, we add new nodes: this action involves a system reaction in order to integrate them and to re-configure the index, the routing tables, etc… At t2, the system starts stabilizing, such as Thibault has defined it. We can see that the time needed by the system to stabilize, the delta t, is not enough to characterize the storage. We can also consider:

the impact on response time of the maximal value of the response time.
the alpha 1: the angular coefficient of the linear regression line of the set of measures taken between (t1, rt1) and (tmax, rtmax). This coefficient gives a figure on the importance of the response time max impact.
the alpha 2: the angular coefficient of the linear regression line of the set of measures taken between (tmax, rtmax) and (t2, rt2).
the delta response time which corresponds to the response time gained by the operation.

Figure 2: This figure shows the effect of adding new nodes on the system.

Most of the NoSQL benchmarks made up to now behave in a similar way.

In the next post, we will describe the adaptation of the TPC-C benchmark to the Yahoo! Cloud Serving Benchmark.

References

[1] Nicolas Degroodt, http://www.linkedin.com/in/nicolasdegroodt

[2] Thibault Dory, http://www.linkedin.com/pub/thibault-dory/7/b0b/991

[3] Benchmark results, http://www.nosqlbenchmarking.com/2011/02/new-results-for-cassandra-0-7-2/

Releated Posts

Muppet: A Modular and Constructive Decomposition for Perturbation-based Explanation Methods

4.08.2025 / Data science / Papers

The topic of explainable AI has recently received attention driven by a growing awareness of the need for transparent and accountable AI. In this paper, we propose a novel methodology to decompose any state-of-the-art perturbation-based explainability approach into four blocks. In addition, we provide Muppet: an open-source Python library for explainable AI.

Insights from GTC Paris 2025

25.06.2025 / Engineering / Blog, Event

Among the NVIDIA GTC Paris crowd was our CTO Sabri Skhiri, and from quantum computing breakthroughs to the full-stack AI advancements powering industrial digital twins and robotics, there is a lot to share! Explore with Sabri GTC 2025 trends, keynotes, and what it means for businesses looking to innovate.

Data storage elasticity – quick view on master thesis work (part 1)

Master Theses

Two directions

References

Releated Posts

Muppet: A Modular and Constructive Decomposition for Perturbation-based Explanation Methods

Insights from GTC Paris 2025

Recent Posts

Muppet: A Modular and Constructive Decomposition for Perturbation-based Explanation Methods

Insights from GTC Paris 2025

Development & Evaluation of Automated Tumour Monitoring by Image Registration Based on 3D (PET/CT) Images

Insights from Data & AI Tech Summit Warsaw 2025

Tracks

Mjolnir

Rune

Vadgelmir

Yggdrasil

Field of expertises

Data architecture

Data governance

Data science

Engineering

Academic collaboration

SERVE

Expertise

CRAFT

digazu

CONTACT

Belgium

France

Tunisia

CAREER

Job Offers

Social media