DevOps Blog - Nicolas Paris

Docker Swarm, Traefik, HAProxy on Google Cloud Platform in real life

Docker SwarmTraefikGCP

In this post, I'll explain the setup and choices of an infrastructure that run on production and aimed to switch smoothly from an old on site infrastructure to a Google Cloud Platform (GCP) setup.

The idea of this post, it's also to show evolutions as it goes from an experimentation, single zone and single manager, to a more robust cluster, multi zones and multi managers nodes.

Once this post published, I will also keep updates to get feedbacks as the evolutions will goes, and keeps those informations up to date.

This setup is done in a day job context, for a small/midrange company with few products, without huge needs. The experimentation is done with 3 websites. One of those is a kind of logging with maybe about half a million request per day.

Update 06/12/2021

2 years latter, this approch help us to understaind better the Cloud Native spirit, without the difficulties for Kubernetes. Now we are in the last step, and migrating to Kubernetes.
You might want check this quick thougt on Docker Swarm in 2021.

Should You Still Use Docker Swarm in 2021

As I said, we now make the migration to Kubernetes, you might want check my feedback on the subject.
Docker Swarm is not dead and still can be usefull, but you might want to think it twice before put this in production now.

I still think it was the right choice at the right time for us, in our case.

Overview

Here's a quick overview of technologies used on this example.

This is what it's look like, we'll talk about every aspect of it. This is a solution, others could be better, but we'll see this one.

The red arrow is where DNS is for the migration, to keep a fast switching proxy, but once stable we will point DNS on the Cloud Load Balancing. Their is an issue to keep this HAProxy up front for everyday uses. The proxy is a single point of failure as it is zonal and not balanced. If the HAProxy zone goes down, everything goes done, no matter how many regions/zones you puts your swarm on.

Cloud Storage

Fisrt step is to set pictures, files and any state items somewhere else. Google Cloud Storage (GCS) was chosen.

Why not a reclaimed Kubernetes volume? Because of Kubernetes! It's a great technology, but with a leaning curve bigger than the Docker Swarm. A small team at work, whitout enought time to maintain and learn about Kubernetes.

Why not a docker volume on a shared filesytem like GlusterFS? I know that some drivers allow that. Once again we're a small team and more knowledged required. I have no idea how this solution would behave at scale. Seems like a risky choice.

This leave the GCS choice, with a PHP SDK realy easy to use and setup. Seems like a good NoOp solution, scallable and releable. However, it could take time to migrate the code the this new solution.

The storage was deploy on production on the old DMZ. Before migrate to the new Docker Swarm infrastructure. This allow to make sure that no errors come from this as it could be more difficult to spot the problem once migrated to the Swarm.

Cloud SQL

If you don't need the SUPERUSER privilege, this is a great solution all in one. Excellent backup plan, replicas and fail-over out of the box.

It's not cheap, it could be aroud $100 USD a month for a 2CPU setup (no replica, no failover, cost more money). But it work's, and well!

This was the second step, once this is done, it does not matter where the code runs, as long as it's stateless, and the session affinity is setup, in case you need PHP sessions or something like it. As it's more difficult to migrate the session state somewhere else.

HAProxy

The idea is to have a quick way to switch from one structure to the other one in case something goes wrong. I explained in other place on this post, but HAProxy was set in a single computer, on a single zone, it become a single point of failure. Their is no point to have a multiple zone swarm if your entry point is managed by a single zone.

Configuration is straightforward.

You might want to think about Let's Encrypt IP. You want to make sure the Let's Encrypt request for the HTTP-01 challange is properly route on your docker swarm if you want to give yourself some time before swiching every one.

Make sure the port 80 is open for let's encrypt if traefik is used for SSL certificats.

No changes was made on the global section, and on the defaults sections, the tcp mode is set

defaults
log global
mode tcp
option tcplog

frontend unsecure
mode http
bind *:80
default_backend backend_unsecure_gcp

acl is_mysite_dot_com hdr_end(host) -i mysite.example.com
acl certbot src 172.65.32.248/32
use_backend backend_unsecure_gcp if is_mysite_dot_com

frontend front_nci
bind *:443
option tcplog
tcp-request inspect-delay 3s
tcp-request content accept if { req.ssl_hello_type 1 }
mode tcp

use_backend backend_gcp if { req_ssl_sni -i mysite.example.com }
use_backend backend_test if { req_ssl_sni -i mysite.example.net }

backend backend_unsecure_gcp
mode http
server gcp 35.xxx.xx.xx:80

backend backend_gcp
mode tcp
balance source
option ssl-hello-chk
server gcp 35.xxx.xxx.xxx:443 weight 1
server old xx.xxx.xxx.xxx:443 weight 9

Few comments on the configuration.

Cloud Load Balancing

The cloud load balancer was introduced in the infrastrucure before the HAProxy, theire is some redondency in it. Once the infrastructure is stable, DNS could point at this load balancer only. HAProxy cannot stay as your main entry, as it's a single point of failure, one zone only. Their is no point to put your swarm manager on multiple zone and region if your HAProxy depends on a single zone.

Even traefik has a build in load balancer.

If the HAProxy is not set, you may dont want to redirect all your requests on a swarm node, the manager for example. If the DNS point at the swarm manager, it would handle all the inbound traffic. If it goes down, everything goes down. Because the Google Cloud Load Balancer is a strong resilient point, fully managed by google, with health check, it would be a good entry point.

Make sure to have both 80 and 443 port open for let's encrypt if you use the HTTP-01 challange with traefik certificate managment.

GCP has 7 differents kind of Load Balancing. 2 of them was used.

Here's the two configuration.

Zonal Load Balancing

The backends points at some instances.

On this case, the backend health check was a https get on the traekif dashboad. We'll see that on the other one, it would be a whoami container deploys on every container balanced.

Frontend, make sure to allow both 80 and 443 if you need let's encrypt to go througt.

Once this is done, your IP will be the one you use a backend in the HAProxy.

Global HTTP(S) load balancer.

Request are handle in 443 port, and proxied in traefik in 80. SSL certificates are handle by the load balancer.

This configuration is a bit more tricky, you need to setup backends endpoints first, setup certificats and make sure DNS points at the Load Balancer frontend IP, this is a requirements as show on the documentation.

This looks like the following screenshoot once setup.

First, setup some endpoints,

One endpoint group per zone

One endpoint groups VM on the setup zone. The backend is set to standard HTTP port.

You will have multiple endpoints on a backend

The following sreenshot show the configuration on one of the two endpoints setup for this backend.

On the front, this is where you'll have certificats and Health check setup.

On this case, the health check is a HTTP request routed on a local IP with a whoami docker image deploy on every instances.

Docker Swarm

Docker Swarm is the container orchestrator used in this example.

First, why not Kubernetes? It seems that Kubernetes has a learning curves that could take more time to get into. Docker swarm is easy to installed. I understaind that every platform, GCP, Azure, AWS has a strong integration of Kubernetes. Here's some points for Docker Swarm.

I wont explain Docker Swarm here or the configuration.

Traefik

This will take care of routing and dynamic docker discovery. This mean it will take care of the route inside the Swarm, front the front to every scaled docker containers.

For reminder, I played with Docker Swarm and Treafik, and post a couple of blog post here. It was for testing purpose, but there is some more basic configurations explanations.

Traefik have to run on a manager node. To run multiple instances, i'm not using the traefik.toml but the service yaml configuraton file instead. Certificates was either on the Google Load Balancer or a Key-Value system like Consul. Traefik can be bind to Consul with little configuration well documented.

For testing purpose, I was running a single swarm manager on a single traefik instance, with acme.json and traefik.toml.

About Certificates

If the cluster contains only one manager, and traefik run only on managers, it become a single point of failure. If traefik or his node goes down, everything goes done because his handle your routing with vhost. Basic configuration might contains two files. acme.json and treafik.toml. The toml configuration file could be include in the service file. The SLL certificates file is more tricky. Traefik can be configure with consul as a backend configuration. The solution is to use consul to share ssl certificates.

The other solution was to handle the certificates directly in the Google Cloud HTTP(S) Load Balancer. This load balancer is a global balancer. Check the above Load Balancer section for more information.

Consul

As explained in the Traefik section, I needed some configuration shared with a backend for let's encrypt SSL certificate. I tried with a small Swarm cluster, it works as expected.

The idea is for migration, as I go througt HAProxy, i use the traefik let's encrypt mecanism. I could use the HAProxy. But once stable, I'll go for the HTTPS load balancer from google as it's a global resource.

Portainer

This is a nice dashboard for your Docker Swarm cluster, services, stacks, logs and so on...

The cluster visualisation could be useful to explain, and for a beter understainding for those of the team who now nothing about Docker and Docker Swarm.

I'm not really using it for the configuration of stacks, as I use yaml files directly.

Stackdriver

Migration of a service

Thoses are the steps taken for the migration from the old infrastrucure to the new one.

  1. Migration of states like files upload and download, done in the old infrastructure.
  2. MIgration of the database, it cloud introduce some latency if it was in local.
  3. Make a dockerfile of the service if not already done.
  4. Configure HAProxy on the new infrastructure.
  5. Deploy the image on Docker Swarm.
  6. Setup traefik.
  7. Set the /etc/hosts of your computer to point at the HAProxy. This will make the test easy. Except for the let's encrypt part. If you want to make sure let's encrypt will work. Setup a test URL before.
  8. At this point, you should have the new infrastructure valide, or almost if you don't have the let's encrypt ready.
  9. Setup the HAProxy to point on your old infrastructure. Make a rule for the let's encrypt IP on the new infractrucure for the challange if needed.
  10. Change DNS to point on the HAProxy. At this point it goes througt GCP but redirect on your old infrastructure. Except for the let's encrypt IP (dont forget port 80!). Check traefik logs, and acme.json to make sure your certificate SSL is ready.
  11. Make a rules on your IP adresse to check everything is fine. But at this point it shoudn't have any trouble as it was already tested.
  12. Play with your HAProxy with weight if needed between old and new infrastructure.
  13. Setup everything on the new infrastructure.
  14. Once this done, you might want to go full on the HTTPS Global load balancer, by swithing DNS and generate a certificate on the balancer.

This is a long blog post that might evoluate to reflect lastest modifications on the architecture. Questions are welcomes.

Related to: