We are looking for some way for applications to federate and communciate data. Traditionally used SQS, however developers have been generally unhappy about doing this and a call for using Celery. When backing Celery via SQS the permission set has been entirely open instead of paired down to a subset of queues based on prefix. I have been reticent to accept this as truth as I feel like there is somethign being over looked. However as an alternative there has been a request to back Celery via RabbitMQ.

The infrastructure group has been tasked with a system which may survive losses of multiple availability zones and still continue on. For the most part this goal has been achieved although the implementation has yet to be tested in production. If RabbitMQ was to be brought up as a service bus for the application then I would want to give the same gaureentee.

A quick search on running RabbitMQ in a high availability recommends clustering the nodes at the Erlang level and mirroring nodes. Talking with our internal Erlang expert, he came across fairly confident at least at the VM level Erlang should be able to reasonably handle cross-node failure. He mentioned in the past operational staff was concerned about cross subnet communication, possible because of latency.

Breaking out some Docker containers

No sense in spending forever reading about things. Time to get clustering and find some challenges. Version 3.7.5 is the version currently out there and available, so I will be playing with it. So an initial bootup of docker run --rm --hostname rabbit-a --name rabbit-a rabbitmq:3.7.5 produces a lot of usable output. This includes the cookie hash bfZtp3CPs22pTJUSTUEMLw==. Hopefully without this passed in it’s uniquely generated per container. Just a little farther down the page it describes how to pass this cookie in via the RABBITMQ_ERLANG_COOKIE environment variable.

Time to bring up a second container reusing the same magic. Using docker run --rm --hostname rabbit-b --name rabbit-b -e 'RABBITMQ_ERLANG_COOKIE=bfZtp3CPs22pTJUSTUEMLw==' rabbitmq:3.7.5 did not generate output with the expected cookie being used, having NaWmwlWMdOs0IEZO7lUtvg== used instead. Expectedly the application did not cluster since I derped on passing those options. If I had actually read the documentation closely I would have known the magic generated on the output is cipher text while the environment variable is in plain text. Time to reboot both containers.

Hmm, after some searching there was not an option I caught which describes how to detect the cluster. I am just going to hope the cluster is joined by multicast for right now and I will revisit the issue shortly. Well Discovered no peer nodes to cluster with definitely means there is more work to be done. Seems rabbit_peer_discovery_classic_config is the default discovery layer here. According to the RabbitMQ Cluster Formation and Peer Discovery guide the easiest approach is to modify a configuration file. Big question is could this be done with the existing image or do we need to add more layers?

Under the Additional Configuration section of the Docker documentation it looks like a flag like -e 'RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="-rabbit peer_discovery_backend [{rabbit_peer_discovery_dns}]"'.

Prior Art

Hmm, well that didn’t work as expected. Time to see what others have done. Hopefully I can find a compass for what to do.

Jay Johnson from Levvel’s Experience

First reasonable hit from the web search was Testing RabbitMQ Clustering using Docker – Part 1 which could be a gold vein of what I am looking for. They took the approach of rolling their own container. The article was written in 2015, so there is real possibility official RabbitMQ images weren’t available. Although curiously they do use docker-compose back when the tool was written in Python. No idea it was available back then.

Not bad for a guide so early in the product cycle. I wonder what lessons I can glean from it. Interestingly they use pre-baked magic cookies an a custom entry point. The config file which is baked into the image does not appear to utilize any of the clustering features. The start script is hopefully the meant of how they configure the image. Interestingly the images do not setup the initial plugins until the begins execution instead of during the build phase. Perhaps it depends on a special identifiers from the instance or the start time for them is insignificant.

A container from their example runs in one of two modes: either as the initial master node or as a clustered node. The master node is brought up without any special tuning. The clustered nodes run the additional commands rabbitmqctl join_cluster rabbit@$CLUSTER_WITH or rabbitmqctl join_cluster --ram rabbit@$CLUSTER_WITH depending if they are a RAM node.

Logs are kept within the container and tailed.

The major lessons from this article: another configuration mechanism is to ask RabbitMQ to join a cluster via the command line in the form rabbit@host. This may or may not need the Erlang VMs already clustered.

I will need to return to this exploration at a later date.