Skip to content

High Availability in RoQ

In the last year, we have worked with Benjamin Van Melle on implementing High Availability in RoQ, our proof-of-concept distributed pub-sub messaging system. As a consequence, we needed to expand our JUnit tests to cover individual component failure scenarios and prove they were handled as expected. This piece will show how we used Docker to achieve this.

Elastic Messaging for the Cloud
Elastic Messaging for the Cloud

About RoQ

RoQ is a distributed publish-subcribe messaging system that we started developing in 2011. It is based around the concept of Exchanges, which receive messages from publishers and forward them to the clients that have subscribed to the corresponding topic for a given queue. If we take a simplified view of the system:

  • the slaves on which Exchanges are hosted each have their own Host Configuration Manager (HCM) process to manage the Exchanges
  • one or more masters on the network each run their own Global Configuration Manager (GCM) process. The active master manages the system-wide configuration: the slaves present, the Exchanges they contain and the queues that they handle. Backup masters patiently wait for the active one to die before electing a new leader to take over its duties.
  • ZooKeeper (via Netflix’s Curator library) is used to store shared state and to support the leader election mechanism


Docker-based testing

Part of the work we’ve carried out was to use Ansible and Docker to deploy RoQ in different environments: local machine (for running a demonstration or for developing) or Amazon EC2 (for production). Once we were able to run the processes of RoQ in individual containers, we could use Spotify’s docker-client library to start/stop/restart/pause/remove them at will. This was essential for testing our new features.

Indeed, the hardest part of building distributed systems is ensuring that they work as expected, which implies testing with several nodes running at the same time. While Chaos Monkey is very useful for forcing random failures, it is important to catch issues as early as possible. We use JUnit for our testing, and we wanted to extend our existing tests to cover failover scenarios.

As an example of how simple this all becomes when using the right tools, here is the test code that checks that GCM failover works and that queue information is still present in the system after a failover:

// Note: test setUp provides an environment with:
// - one ZooKeeper node
// - one GCM (master) node
// - one HCM (slave) node
public void testDataConsistency() {
 try {
    // Create a queue called "testQ0"

    // Get the current (active) GCM ID
    // The launcher is an instance of RoQDockerLauncher
    // which wraps docker-client to manage our Docker
    // instances.
    String id = launcher.getGCMList().get(0);

    // Create a backup GCM

    // Stop the active one

    // queueExists() asks the active GCM whether the
    // queue that we've created still exists. Since
    // we've just killed the active one, this call
    // blocks until the backup GCM has become active.
    assertEquals(true, queueExists("testQ0"));

    } catch (Exception e) {



In the end, the tests that we’ve written show that:

  • master failover works: when we kill the active GCM, a backup GCM takes over
  • complete slave failover works: when we kill a slave while it is actively receiving and forwarding messages, another one takes over its load, its publishers and its subscribers.

Lastly, although it was interesting to make RoQ highly available from a technical standpoint, the most important aspect of this work is the knowledge we’ve gained. It will help us ensure that our clients get reliable, easily-tested solutions, even when involving complex distributed systems.

Releated Posts

The Building Blocks of a Responsible AI Practice: An Outlook on the Current Landscape

Responsible AI comes with the challenge of implementation. This survey aims to bridge the gap between principles and practice through a study of different approaches taken in the literature and the proposition of a foundational framework.
Read More

TS-Relax : Interprétation des représentations apprises pour les séries temporelles

Les modèles d’apprentissage de représentations sont de plus en plus utilisés, mais des modèles d’IA explicables et de confiance sont nécessaires. Ce travail présente l’adaptation aux séries temporelles d’une méthode d’interprétation de représentation initialement conçue pour les images.
Read More