Blackfish: a CoreOS VM to build swarm clusters for Dev & Production

Table of Contents

Note: you may prefer reading the README on the gitlab website.

1 Description

Blackfish is a pre-built CoreOS VM that ease the bootstrap of swarm clusters. It contains all basics services required to ease the deployment of your apps and manage basic production operations. All the servics are pre configured for production with HA and TLS communication enabled.

  • a consul cluster with a registrator
  • an internal HA docker registry
  • an haproxy load balancer with auto discovery
  • centralized logs with an HA graylog2 setup
  • centralized monitoring with telegraf+influx+grafana

The project also comes with all the materials to boot a cluster on your local machine with a Vagrant setup, or on AWS EC2 with terraform scripts.

We hope it will help you fill the gap between "Docker in Dev" versus "Docker in Production"


Figure 1: Docker in Dev VS Prod

You'll find other and maybe simpler tutorials or github projects to deploy swarm on AWS, but if you don't want your cluster to be exposed on public facing IPs, you'll then have to get your hands dirty on a lot of other things.

Blackfish is built on top of the following components:

  • Packer for building the boxes for various providers (Virtualbox, AWS, Kvm, …)
  • Terraform for provisioning the infrastructure on AWS
  • Vagrant for running the swarm cluster in virtualbox

2 Pre-Requisites

To use this project you will need at least this list of tools properly installed on your box:

  • docker 1.10
  • vagrant 1.8
  • virtualbox 5.0

3 Quickstart

To quickly bootstrap a swarm cluster with vagrant on your local machine, configure the nodes.yml file and type the following commands.

$ git clone
$ cd blackfish
$ vi nodes.yml
provider            : "virtualbox"
admin-network       :
box                 : yanndegat/blackfish-coreos
id                  : myid
stack               : vagrant
dc                  : dc1
consul-joinip       :
consul-key          : hMJ0CFCrpFRQfVeSQBUUdQ==
registry-secret     : A2jMOuYSS+3swzjAR6g1x3iu2vY/cROWYZrJLqYffxA=
ssh-key : ./
private-ssh-key : ~/.vagrant.d/insecure_private_key
  - ip      :
    docker-registry : true
    memory  : "3072"
    cpus    : "2"
  - memory  : "3072"
    cpus    : "2"
    ip      :
  - memory  : "3072"
    cpus    : "2"
    ip      :
$ vagrant up
==> box: Loading metadata for box 'yanndegat/blackfish'
    box: URL:
==> box: Adding box 'yanndegat/blackfish' (v0.1.0) for provider: virtualbox
    box: Downloading:
    box: Progress: 26% (Rate: 1981k/s, Estimated time remaining: 0:02:24)

TLS certificates have been generated in your $HOME/.blackfish/vagrant directory. You have to declare the CA Cert on your host, according to your system.

Once you have registered your tls certificates setup your dns configuration, you can go to https://consul-agent.service.vagrant:4443/ui

Now refer to the Play with swarm section to go futher.

4 Getting Started on AWS

This section is obsolete. To use blackfish on aws, several terraform scripts are available but they're toooooo complex. Instead we want to be able to use the same "nodes*yml" configuration files to bootstrap swarm on aws. To do so, we will in a near future provide some kind of go binary that reads "nodes.yml" & terraform templates and outputs a complete terraform directory.

Meanwhile, you can use the terraform/aws/vpc/ templates or refer to the old scripts to bootstrap the vpc, and use vagrant –provider=aws to bootstrap the aws nodes.

Refer to the Aws README file.

Here's an exemple of what a nodes.yml for aws could look like

provider            : aws
admin-network       :
stack               : demo
dc                  : one
id                  : prod1
consul-joinip       :
consul-key          : dlnA+EWVSiQyfd0vkApmTUu4lDvMlmJcjMy+8dMEVkw=
registry-secret     : rJ/9vXube9iujCjFiniJODQX60Q/XJytUJyOKQfPaLo=

journald-sink       : journald.service.demo:42201
influxdb-url        : http://influxdb.service.demo:48086

labels              :
  - type=control

  - ip              :
    docker-registry : true
  - ip              :
  - ip              :

  region            : eu-west-1
  access-key-id     : XXXXXXXXXXXXXXX
  availability-zone : eu-west-1a
  instance-type     : m3.xlarge
  ebs-optimized     : false
  keypair-name      : my-keypair
  private-ssh-key   : ~/.ssh/my-keypair.key
  subnet-id         : subnet-d779afb3
  s3bucket          : bucket-742587092752
  security-groups   :
    - sg-2676b741

5 Play with your swarm cluster

Now we can play with swarm.

5.1 Using the swarm cluster

You can now use your swarm cluster to run docker containers as simply as you would do to run a container on your local docker engine. All you have to do is target the IP of one of your swarm node.

$ export PATH=$PATH:$(pwd)/bin
$ blackfish vagrant run --rm -it alpine /bin/sh
/ # ...

5.2 Using the Blackfish internal registry

The Blackfish internal registry which is automatically started on the swarm cluster is registered on the "" name. So you have to tag & push docker images with this name if you want the nodes to be able to download your images.

As the registry has an auto signed TLS certificate, you have to declare its CA Cert on your docker engine and on your system (again according to your OS)

$ export PATH=$PATH:$(pwd)/bin
$ sudo mkdir -p /etc/docker/certs.d/registry.service.vagrant
$ sudo cp ~/.blackfish/vagrant/ca.pem /etc/docker/certs.d/registry.service.vagrant/
$ sudo systemctl restart docker
$ docker tag alpine registry.service.vagrant/alpine
$ docker push registry.service.vagrant/alpine
$ blackfish vagrant pull registry.service.vagrant/alpine
$ blackfish vagrant run --rm -it registry.service.vagrant/alpine /bin/sh
/ # ...

5.3 Run the examples

Examples are available in the examples directory. You can play with them to discover how to work with docker swarm.

6 Blackfish Components

6.1 Architecture guidelines

The Blackfish VM tries to follow the Immutable Infrastructure precepts:

  • Every component of the system must be able to boot/reboot without having to be provisionned with configuration elements other than via cloud init.
  • Every component of the system must be able to discover its pairs and join them
  • If a component can't boot properly, it must be considered as dead. Don't try to fix it.
  • To update the system, we must rebuild new components and replace existing ones with the new ones.

6.2 Blackfish is architectured with the following components :

  • a consul cluster setup with full encryption setup, which consists of a set of consul agents running in "server" mode, and additionnal nodes running in "agent" mode. The consul cluster could be used :
    • as a distributed key/value store
    • as a service discovery
    • as a dns server
    • as a backend for swarm master election
  • a swarm cluster setup with full encryption setup, which consists of a set of swarm agents running in "server" mode, and additionnal nodes running in agent mode. Every swarm node will also run a consul agent and a registrator service to declare every running container in consul.
  • a HA private docker registry with TLS encryption. It's registered under the dns address registry.service.vagrant. HA is made possible by the use of a shared filesystem storage. On AWS, it is possible to configure the registry's backend to target a S3 bucket.
  • a load balancer service built on top of haproxy and consul-template with auto discovery

Some nodes could play both "consul server" and "swarm server" roles to avoid booting too many servers for small cluster setups.

6.3 The Nodes.yml

The philosophy behind the "nodes.yml" files is to configure several "nodes.yml" files that each defines small clusters with special infrastructure features, and make them join together to form a complete and robust swarm cluster, multi az, multi dc, with several type of storage, …

You can select the nodes.yml file you want to target by simply setting the BLACKFISH_NODES_YML env variable

$ BLACKFISH_NODES_YML=nodes_mycluster_control.yml vagrant up --provider=aws
$ BLACKFISH_NODES_YML=nodes_mycluster_ebs-storage.yml vagrant up --provider=aws
$ BLACKFISH_NODES_YML=nodes_mycluster_ssd.yml vagrant up --provider=aws

7 Considerations & Roadmap

7.1 Volume driver Plugins

Flocker is available on blackfish but it has been disabled due to a critical bug when running with a docker engine < 1.11. Rex Ray and Convoy suffer the same bug. Plus flocker's control service is not HA ready.

7.2 CoreOS Alpha channel

Too many critical bugs are fixed on every docker release and the alpha channel is the one that sticks to the docker engine. When a good configuration of the different components will be stabilized, we will move to more stable channels.

7.3 Consul + Etcd on the same hosts ?

It sounds crazy. Yet it is recommanded to use 2 separate consul clusters to run a swarm cluster : One for master election and service discovery, one for the docker overlay network. As we run on coreos, there's a feature we'd like to enable: the automatic upgrade of coreos nodes based on their channel. To avoid that a whole cluster reboots all its node a the same time, coreos can use etcd to provide coordination.

7.4 Use docker-machine

We didn't use docker-machine as it forces us to enter the "post provisionning" way of doing things.

7.5 Run consul and swarm services as rocker containers

There are some caveats running the system services as docker containers, even on coreos. The main problem is the process supervision with systemd, as full described in this article. Plus, we don't want system services to be visible and "killable" by a simple docker remote command.

That's why every system component is run with the rocket coreos container engine.

7.6 Monitoring

Monitoring and Log centralization is provided in a simple yet powerful manner :

Each node can be configured to report metrics and output its log to a remote machine. That way, you can bootstrap a cluster and then run monitoring tools such as graylog2 and influxdb on it, the nodes will automatically start sending to it.

7.7 Running on GCE


7.8 Running on Azure


7.9 Running on premise

Thats why we'e chosen CoreOs. Coreos comes with powerful tools such as Ignition and coreos-baremetal that allow us to boot our solution on premise infrastructures.

7.10 How to do rolling upgrades of the infrastructure with terraform…?

well. that has to be defined.

8 Configure DNS resolution

Before using swarm, you have to declare the Blackfish VMs internal DNS in your system. To do so, you have multiple options:

  • add one of the hosts in your /etc/resolv.conf (quick but dirty)
  • configure your network manager to add the hosts permanently

    echo 'server=/vagrant/' | sudo tee -a /etc/NetworkManager/dnsmasq.d/blackfish-vagrant 
    sudo systemctl restart NetworkManager.service
  • configure a local dnsmasq service which forwards dns lookups to the Blackfish Dns service according to domain names.

For the latter solution, you can refer to the ./bin/dns file which runs a dnsmasq forwarder service.

$ # check your /etc/resolv.conf file
$ cat /etc/resolv.conf
$ # eventually run the following command for your next boot (according to your OS)
$ sudo su
root $ echo "nameserver" > /etc/resolv.conf.head
root $ exit
$ ./bin/dns vagrant
$ dig registry.service.vagrant

$ ...

9 Registering TLS Certificates

9.1 Registering the CA Root on your OS

According to your OS and your distro, there are several ways to register the ca.pem on your os

$ sudo cp ~/.blackfish/vagrant/ca.pem /usr/local/share/ca-certificates/blackfish-vagrant-ca.crt
$ sudo update-ca-certificates
$ sudo cp ~/.blackfish/vagrant/ca.pem /etc/ca-certificates/trust-source/blackfish-vagrant-ca.crt
$ sudo update-ca-trust

9.2 Registering the certificates on your browser

Actually, there's a problem with the genererated certificates under chrome/chromium browsers. We have to spend time on it to now why chrome rejects it. You can still use firefox by following these steps:

  • go to the Preferences/Advanced/Certificates/View Certificates/Authorities menu
  • import the $HOME/.blackfish/vagrant/ca.pem file
  • go to the Preferences/Advanced/Certificates/View Certificates/Your Certificates menu
  • import the $HOME/.blackfish/vagrant/client.pfx

Now you should be able to access the https://consul-agent.service.vagrant:4443/ui url.

Created: 2016-06-12 dim. 19:10