In the last year, we have worked with Benjamin Van Melle on implementing High Availability in RoQ, our proof-of-concept distributed pub-sub messaging system. As a consequence, we needed to expand our JUnit tests to cover individual component failure scenarios and prove they were handled as expected. This piece will show how we used Docker to achieve this.
About RoQ
RoQ is a distributed publish-subcribe messaging system that we started developing in 2011. It is based around the concept of Exchanges, which receive messages from publishers and forward them to the clients that have subscribed to the corresponding topic for a given queue. If we take a simplified view of the system:
- the slaves on which Exchanges are hosted each have their own Host Configuration Manager (HCM) process to manage the Exchanges
- one or more masters on the network each run their own Global Configuration Manager (GCM) process. The active master manages the system-wide configuration: the slaves present, the Exchanges they contain and the queues that they handle. Backup masters patiently wait for the active one to die before electing a new leader to take over its duties.
- ZooKeeper (via Netflix’s Curator library) is used to store shared state and to support the leader election mechanism
Docker-based testing
Part of the work we’ve carried out was to use Ansible and Docker to deploy RoQ in different environments: local machine (for running a demonstration or for developing) or Amazon EC2 (for production). Once we were able to run the processes of RoQ in individual containers, we could use Spotify’s docker-client library to start/stop/restart/pause/remove them at will. This was essential for testing our new features.
Indeed, the hardest part of building distributed systems is ensuring that they work as expected, which implies testing with several nodes running at the same time. While Chaos Monkey is very useful for forcing random failures, it is important to catch issues as early as possible. We use JUnit for our testing, and we wanted to extend our existing tests to cover failover scenarios.
As an example of how simple this all becomes when using the right tools, here is the test code that checks that GCM failover works and that queue information is still present in the system after a failover:
@Test // Note: test setUp provides an environment with: // - one ZooKeeper node // - one GCM (master) node // - one HCM (slave) node public void testDataConsistency() { try { // Create a queue called "testQ0" initQueue("testQ0"); // Get the current (active) GCM ID // The launcher is an instance of RoQDockerLauncher // which wraps docker-client to manage our Docker // instances. String id = launcher.getGCMList().get(0); // Create a backup GCM launcher.launchGCM(); // Stop the active one launcher.stopGCMByID(id); // queueExists() asks the active GCM whether the // queue that we've created still exists. Since // we've just killed the active one, this call // blocks until the backup GCM has become active. assertEquals(true, queueExists("testQ0")); } catch (Exception e) { e.printStackTrace(); } }
Conclusion
In the end, the tests that we’ve written show that:
- master failover works: when we kill the active GCM, a backup GCM takes over
- complete slave failover works: when we kill a slave while it is actively receiving and forwarding messages, another one takes over its load, its publishers and its subscribers.
Lastly, although it was interesting to make RoQ highly available from a technical standpoint, the most important aspect of this work is the knowledge we’ve gained. It will help us ensure that our clients get reliable, easily-tested solutions, even when involving complex distributed systems.