High Availability in RoQ

Data architecture, Engineering

October 6, 2015

In the last year, we have worked with Benjamin Van Melle on implementing High Availability in RoQ, our proof-of-concept distributed pub-sub messaging system. As a consequence, we needed to expand our JUnit tests to cover individual component failure scenarios and prove they were handled as expected. This piece will show how we used Docker to achieve this.

Elastic Messaging for the Cloud

About RoQ

RoQ is a distributed publish-subcribe messaging system that we started developing in 2011. It is based around the concept of Exchanges, which receive messages from publishers and forward them to the clients that have subscribed to the corresponding topic for a given queue. If we take a simplified view of the system:

the slaves on which Exchanges are hosted each have their own Host Configuration Manager (HCM) process to manage the Exchanges
one or more masters on the network each run their own Global Configuration Manager (GCM) process. The active master manages the system-wide configuration: the slaves present, the Exchanges they contain and the queues that they handle. Backup masters patiently wait for the active one to die before electing a new leader to take over its duties.
ZooKeeper (via Netflix’s Curator library) is used to store shared state and to support the leader election mechanism

Docker-based testing

Part of the work we’ve carried out was to use Ansible and Docker to deploy RoQ in different environments: local machine (for running a demonstration or for developing) or Amazon EC2 (for production). Once we were able to run the processes of RoQ in individual containers, we could use Spotify’s docker-client library to start/stop/restart/pause/remove them at will. This was essential for testing our new features.

Indeed, the hardest part of building distributed systems is ensuring that they work as expected, which implies testing with several nodes running at the same time. While Chaos Monkey is very useful for forcing random failures, it is important to catch issues as early as possible. We use JUnit for our testing, and we wanted to extend our existing tests to cover failover scenarios.

As an example of how simple this all becomes when using the right tools, here is the test code that checks that GCM failover works and that queue information is still present in the system after a failover:

@Test
// Note: test setUp provides an environment with:
// - one ZooKeeper node
// - one GCM (master) node
// - one HCM (slave) node
public void testDataConsistency() {
 try {
    // Create a queue called "testQ0"
    initQueue("testQ0");

    // Get the current (active) GCM ID
    // The launcher is an instance of RoQDockerLauncher
    // which wraps docker-client to manage our Docker
    // instances.
    String id = launcher.getGCMList().get(0);

    // Create a backup GCM
    launcher.launchGCM();

    // Stop the active one
    launcher.stopGCMByID(id);

    // queueExists() asks the active GCM whether the
    // queue that we've created still exists. Since
    // we've just killed the active one, this call
    // blocks until the backup GCM has become active.
    assertEquals(true, queueExists("testQ0"));

    } catch (Exception e) {
    e.printStackTrace();
    }
}

Conclusion

In the end, the tests that we’ve written show that:

master failover works: when we kill the active GCM, a backup GCM takes over
complete slave failover works: when we kill a slave while it is actively receiving and forwarding messages, another one takes over its load, its publishers and its subscribers.

Lastly, although it was interesting to make RoQ highly available from a technical standpoint, the most important aspect of this work is the knowledge we’ve gained. It will help us ensure that our clients get reliable, easily-tested solutions, even when involving complex distributed systems.

Releated Posts

Insights from GTC Paris 2025

25.06.2025 / Engineering / Blog, Event

Among the NVIDIA GTC Paris crowd was our CTO Sabri Skhiri, and from quantum computing breakthroughs to the full-stack AI advancements powering industrial digital twins and robotics, there is a lot to share! Explore with Sabri GTC 2025 trends, keynotes, and what it means for businesses looking to innovate.

Development & Evaluation of Automated Tumour Monitoring by Image Registration Based on 3D (PET/CT) Images

23.05.2025 / Engineering / Academic collaborations, Papers

Tumor tracking in PET/CT is essential for monitoring cancer progression and guiding treatment strategies. Traditionally, nuclear physicians manually track tumors, focusing on the five largest ones (PERCIST criteria), which is both time-consuming and imprecise. Automated tumor tracking can allow matching of the numerous metastatic lesions across scans, enhancing tumor change monitoring.

High Availability in RoQ

About RoQ

Docker-based testing

Conclusion

Releated Posts

Insights from GTC Paris 2025

Development & Evaluation of Automated Tumour Monitoring by Image Registration Based on 3D (PET/CT) Images

Recent Posts

Insights from GTC Paris 2025

Development & Evaluation of Automated Tumour Monitoring by Image Registration Based on 3D (PET/CT) Images

Insights from Data & AI Tech Summit Warsaw 2025

Insights From Flink Forward 2024

Tracks

Mjolnir

Rune

Vadgelmir

Yggdrasil

Field of expertises

Data architecture

Data governance

Data science

Engineering

Academic collaboration

SERVE

Expertise

CRAFT

digazu

CONTACT

Belgium

France

Tunisia

CAREER

Job Offers

Social media