Infrastructure as YAML

This post was originally hosted at the Wealthsimple Engineering Blog, which is no more. It was saved by our friends at the Internet Archive.

At Wealthsimple, our goal is to have scalable, secure, and stable infrastructure, while also allowing developers to quickly and easily deploy new code. We want to provide a great customer experience with an always-up system, while still giving access to great features as rapidly as we can add them. This will be the first in a series of posts about the technologies we use to manage our platform, deploy our software, and some approaches to security.

This post will talk about how we approach configuration management (how we define our stack, AWS resources and software - aka the often repeated phrase “Infrastructure as code”). In our case, it is a bunch of YAML! This is due to our heavy use of Ansible.

[chuck@chuck infra-code]$ find . -name *.yml | wc -l
     545
[chuck@chuck infra-code]$ find . -name *.yml | xargs wc -l | tail -1
   19659

Why Ansible?

When choosing technologies to power our infrastructure automation, we have some underlying principles that guide our decisions. First and foremost, there must (where feasible) be something resembling revision control. Second, there should be good support (community adoption, availability on different platforms, etc.) for the technology and how it may interact with other technologies being used. Third, familiarity of the tech by those who will be utilizing it most often doesn’t hurt. After this, domain specific questions come into play.

Since Ansible is a bunch of plain text files, these are easily thrown into a git or mercurial repo. It has excellent support for the other technologies we employ (AWS, docker, etc.). Additionally, there was familiarity with Ansible on the infrastructure and development teams. This satisfied our base requirements.

Equally important is where and how we will be running our configuration management. The underlying machines that run our software are ephemeral by design. Ansible lends itself well to this model, due to its ability to generate inventory dynamically. Since it is agentless, Ansible only requires a properly secured ssh daemon on the remote host, and it can do all its work from a centralized executor. Different tasks can be compartmentalized into different playbooks. This allows us to run everything from ad-hoc tasks to building our entire AWS infrastructure. A future post will dig further into the details of how some of this happens.

Tasks, roles, playbooks, and variables

The basic unit of work in Ansible is a task, using pre-packaged modules. You can create and move files, manipulate system resources, install packages and start their services, use a template engine to dynamically create files: typical systems administration stuff. A collection of tasks is combined into a role. An example role could be to install, configure, and start NTP.

[chuck@chuck infra-code/roles/ntpd]$ tree
. # comments not generated by 'tree'
├── handlers            # run when triggered
│   └── main.yml
├── tasks               # in order that they are run
│   ├── main.yml
│   ├── install.yml     # install service
│   ├── config.yml      # use template to generate system specific config
│   └── service.yml     # start service
└── templates
    └── etc
        └── ntp.conf.j2 # template for config

Finally, a collection of roles is a playbook. Playbooks also target which subset of the supplied inventory to run on, whether to act on batches (serial and/or parallel), and maybe even how to respond to failures. This is a (truncated) playbook that gathers some variables, then installs, configures, and starts an NTP and ssh server:

---
- hosts: all
  max_fail_percentage: 0
  serial:
    - 1                   # canary
    - '100%'              # the rest

  roles:
    - role: loadvariables # load variables
    - role: ntpd
      params:             # parameters passed to role
        consts:
          servers:
            - '0.amazon.pool.ntp.org'
            - '1.amazon.pool.ntp.org'
            - '2.amazon.pool.ntp.org'
            - '3.amazon.pool.ntp.org'
    - role: ssh
      params:             # parameters passed to role
        env: ""
        secrets: ""

One role I’d like to unpack a little bit is loadvariables. A significant portion of our YAML is just configuration. Settings for this app, parameters for that service, definitions for our infrastructure … even SSL keys, passwords, and other secret or sensitive parameters. These last two are stored in Ansible vaults, an encrypted-at-rest format that is ultimately - you guessed it - YAML.

The loadvariables role will ingest all the YAML files it is expecting to find. This is aggregated data from several different repos. During deployment this will include some common data, system metadata, variables pertinent for that environment, and (deploy time) decrypted secrets. This decryption process will be explained in a subsequent post which discusses some of our approaches to security.

Ansible purists would tend to use something like groupvars. This is a little bit of technical debt that works really well for us, which makes it difficult to correct, as we regularly pull configs from different sources. We aim for the best solution at the time, and are always improving our approach to clean up issues such as this. Everything is a work in progress.

AWS CloudFormation, Jinja2, and more YAML

Up until this point, the focus has very much been about setting up a system with our specifications, including all required software to be installed, configured, and running (up to and including our own code). This really becomes powerful when we can define and create the actual systems upon which all the software will run.

AWS provides a veritable panoply of services: virtual machines, databases, message queues, storage, DNS - the list is exhaustive. While you can create and modify these services with their web based interface, you’ll have less opportunity to review changes, and won’t have version control. Alternatively, AWS provides a fairly powerful command line tool (everything has an API!), but they also provide an Infrastructure as Code system called CloudFormation. The templates you can create are then applied to define your services.

We have taken many of these templates, and wrapped them in Jinja2 (itself a template definition). We chose to do this because: a) this also allows us to bake in some best practices, b) abstract away parts that would be repetitive, and c) Jinja2 has a more powerful templating engine as compared to what is possible with CloudFormation parameters.

We then run Ansible to interpolate YAML variables, then push CloudFormation to AWS to craft what we require. This includes the databases and machines onto which we deploy our software. Abstracting some AWS-isms, this is a section of a Jinjafied CloudFormation template that will create a Redis cluster:

  RedisCluster:
    Type: 'AWS::ElastiCache::ReplicationGroup'
    Properties:
      ReplicationGroupDescription: 'Redis replication cluster'
      Engine: 'redis'
      AutomaticFailoverEnabled: True
      CacheNodeType: ''
      CacheSubnetGroupName: !Ref WSRedisSubnetGroup
      NumCacheClusters: ''
      Port: '6379'
      SecurityGroupIds:
        

Now that we have a template, one can generate a Redis cluster with just a few lines of config. Just craft the following YAML variable namespace, and then apply it to AWS. The results are predicable, alterable with some simple config changes, and change controlled.

---
redis:  
  instanceType: 'cache.m3.medium'
  nodeCount: '2'
  securityGroups:
    - 'Base'
    - 'RedisCluster01'

Conclusion

Our approach has some amazing benefits. We are able to define and spin up local development or staging environments as easily as production. This is a simple matter of slightly different variable definitions. We can scale just by altering some parameters - this works both for our process and the infra itself. We can bake best security practices into our CloudFormation and configuration management. Whatever we boot from our templates gains those things automatically without us having to remember to do it every time.

It’s stable because the abstraction allows us to test our configuration managed instances and infrastructure locally, then against staging. We also have some fairly high confidence that how an app runs locally will run the same in staging, and the same in production. Well, mostly the same - there are always corner cases where different environments require slightly different work flows. But we work hard to smooth out those edges.

There is a minor drawback. YAML itself. It’s human readable and produceable, but it’s also whitespace sensitive. If you’re dealing with Ansible and YAML, do yourself a favour: install yamllint, and integrate it to your development, testing, and deployment workflow!

Check in regularly for other posts from the Wealthsimple developments teams, and stay tuned for future posts from our Infrastructure team!