Case Study:

Crystal Castle

Masterless distributed job scheduling system for load balancing across a PaaS.

The core of this work was to find a way to provide a PaaS that could be run across multiple servers without requiring master nodes to co-ordinate it. Existing solutions like Docker Swarm and Kubernetes all require master nodes to coordinate work between worker nodes. This is also true of my prior art in the field, mainly Field Marshal.

Master nodes are a reasonable solution, but add significant complexity as they need to be redundant for high availability of the swarm. Crystal Castle’s aim is to create a system where all communication and coordination between worker nodes is peer-to-peer. This allows swarms to be set up with much less effort, making the benefits of distributed architectures available to smaller teams with less investment.

In designing Crystal Castle’s scheduling algorithm I took my primary inspiration from the Kademlia papers and the design of distributed databases like Riak. A variant of the gossip protocol was also considered, but a consistent hash ring ended up being the simpler design.

Presentation outlining the algorithm

Implementation of the algorithm in Go

Contact


[email protected]

Let's work together