Skip to content

Data storage elasticity – quick view on master thesis work (part 1)

Master Theses

In this post I would like to speak about two master theses that EURA NOVA is managing with the Faculty of Science Engineering of the Université Libre de Bruxelles (ULB) and with the Université Catholique de Louvain (UCL).  The two students have been working on the same topic: the elasticity of  data storage on the cloud.  The first cool stuff to notice is that they are working on two different aspects of the elasticity by taking different directions, but at the end of the day, by their two contributions they draw a complete picture of the NoSQL benchmarking in the cloud. In this post I will give you a preview of their work that should be published in June 2011.

The subject is the study of the elasticity on the cloud for data storage. As you could have guessed, their investigations have quickly been oriented towards NoSQL storages. The objectives of the work are (1) studying the data storage options on the cloud, (2) defining KPIs describing the elastic nature of storage, (3) testing and benchmarking them, and finally (4) defining a taxonomy of storage in the cloud. In this post, I will focus on the second objective for NoSQL.

Two directions

Both students have to measure 3 main KPIs:

  • Raw performance, in terms of CPU usage, memory print and latency.
  • Scalability: the ability to keep CPU usage, memory print and latency acceptable while increasing the load.
  • Elasticity, the most complex KPI to define.

 

The first student, Nicolas Degroodt [1], has defined the notion of elasticity as the response of the system in order to adapt itself to the load by adding or removing instances. Then, he has described a criteria in order to measure this response and proposed to evaluate the elasticity according to the way the system behaves. Indeed, the adding or removal of instances can introduce several re-configuration in  NoSQL that can lead to the increase of response time or even the unavailability of the storage. Nicolas defined this notion as the concept of “absorption” of the system when new instances are added or removed. So great, we have the criteria, now we need to have a test. Nicolas has preferred to look at the existing literature concerning storage benchmarking, such as the TPC-C benchmark, and adapted it to the NoSQL concept. As a result  Nicolas’s work has been directed more towards the benchmark side by taking the time to adapt a benchmarking method and a framework, and by carefully defining measures and KPIs.

The second student, Thibault Dory [2], has defined elasticity  as the time needed for the system to stabilize when instances are removed or added. In order to quantify this time, Thibault computes the standard deviation of the response time. Actually he measures the standard deviation on a stable cluster, adds a node and then waits for the standard deviation to reach the same value as before. Then, he considers that the system is stable. The step-by-step methodology can be found  here. In the test plan he has chosen the whole English wikipedia content (28GB) as a base for the execution of a word research with Map/Reduce. The test results and methodology can be found here. Thibault’s work has been clearly oriented by the comparison between existing NoSQL systems and by doing a great job in benchmarking HBase, MongoDB, Cassadandra, Voldemort and Riak.

Figure 1: The first benchmark results from (3) obtained by applying a M/R request for the search of a text within the wikipedia content.
Figure 1: The first benchmark results from (3) obtained by applying a M/R request for the search of a text within the wikipedia content.

Figure 2 represents the usual behavior for an elastic storage. At t1, we add new nodes: this action involves a system reaction in order to integrate them and to re-configure the index, the routing tables, etc… At t2, the system starts stabilizing, such as Thibault has defined it. We can see that the time needed by the system to stabilize, the delta t, is not enough to characterize the storage. We can also consider:

  • the impact on response time of the maximal value of the response time.
  • the alpha 1: the angular coefficient of the linear regression line of the set of measures taken between (t1, rt1) and (tmax, rtmax). This coefficient gives a figure on the importance of the response time max impact.
  • the alpha 2: the angular coefficient of the linear regression line of the set of measures taken between (tmax, rtmax) and (t2, rt2).
  • the delta response time which corresponds to the response time gained by the operation.
Figure 2: This figure shows the effect of adding new nodes on the system.
Figure 2: This figure shows the effect of adding new nodes on the system.

Most of the NoSQL benchmarks made up to now behave in a similar way.

In the next post, we will describe the adaptation of the TPC-C benchmark to the Yahoo! Cloud Serving Benchmark.

References

[1] Nicolas Degroodt, http://www.linkedin.com/in/nicolasdegroodt

[2] Thibault Dory, http://www.linkedin.com/pub/thibault-dory/7/b0b/991

[3] Benchmark results, http://www.nosqlbenchmarking.com/2011/02/new-results-for-cassandra-0-7-2/


 

Releated Posts

Calibrate to Interpret

Trustworthy machine learning is driving a large number of the ML community works in order to improve ML acceptance and adoption. In this paper, we show a first link between uncertainty and explainability, by studying the relation between calibration and interpretation.
Read More

Mass Estimation of Planck Galaxy Clusters using Deep Learning

Galaxy cluster masses can be inferred indirectly using measurements from X-ray band, Sunyaev-Zeldovich (SZ) effect signal or optical observations. Unfortunately, all of them are affected by some bias. Alternatively, we provide an independent estimation of the cluster masses from the Planck PSZ2 catalogue of galaxy clusters using a machine-learning method.
Read More