This post covers how to achieve zero downtime updates of an AMI with an AWS Auto Scaling Group and using Ansible. At Opendoor, we use Convox, ECS and Docker for most of our backend services, but this solution isn't a perfect fit for all our use cases.

Once we decided that we wanted to deploy via AMI, we couldn’t find a well-documented method for doing a rolling update of an AMI with Ansible that fitted our requirements:

  • Zero downtime update
  • Detecting an existing Launch Configuration
  • Detecting if the AMI needs to be updated
  • Deleting the old Launch Configuration
  • Creating or updating the Auto Scaling Group with the new launch configuration
  • Preserving the same number of instances running during the update

Why an AMI

Opendoor is built on top of AWS. Without entering into too much detail, we have a microservice which is powered by ~5GB of geodata. It's no secret that data-intensive services are the trickiest thing to deploy in modern systems -- shipping the data within a docker container was not a maintainable solution and neither was downloading it at runtime. The workload on this service is highly uneven and would need to scale quickly when large batch jobs are utilizing it. Our problem was a perfect fit for an AWS Auto Scaling Group and pre-built AMI.

Why Ansible

We were already using Ansible to provision our AWS infrastructure and configuring hosts. Ansible has a pretty good support for provisioning AWS primitives using the AWS APIs. We considered using Terraform but if you don't already use Terraform to describe all your resources then it cannot discover existing resources.

Building an AMI

Luckily, tools like Packer make it easy to build an AMI from different kinds of provisioners like a shell script, Ansible. This was perfect for us because we were already using Ansible for other provisioning tasks. To build an AMI using Packer and Ansible, all you need to do is create a JSON configuration file for Packer like below:

{
  "variables": {
      "aws_access_key": "",
      "aws_secret_key": ""
  },
  "builders": [{
      "type": "amazon-ebs",
      "access_key": "{{user `aws_access_key`}}",
      "secret_key": "{{user `aws_secret_key`}}",
      "region": "us-east-1",
      "source_ami": "ami-4fc1c025",
      "instance_type": "t2.small",
      "ssh_username": "ubuntu",
      "ami_name": "coolapp {{timestamp}}",
      "iam_instance_profile": "your_instance_iam_profile",
      "run_tags": {"Name": "coolapp"},
      "run_volume_tags": {"Name": "coolapp"},
      "tags": {"Name": "coolapp"}

  }],
  "provisioners": [
      {
          "type": "shell",
          "inline": ["sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 6125E2A8C77F2818FB7BD15B93C4A3FD7BB9C367",
                     "sudo bash -c \"echo 'deb http://ppa.launchpad.net/ansible/ansible/ubuntu vivid main' > /etc/apt/sources.list.d/ansible.list\"",
                     "sudo apt-get update",
                     "sudo apt-get -y install ansible"]
      },
      {
          "type": "ansible-local",
          "playbook_file": "{{ template_dir }}/playbook.yml",
      }
  ]
}

Zero Downtime Update

When you want zero downtime you need to ensure that the traffic doesn't get disturbed and the server capacity is preserved. There are different strategies to do zero downtime deployment and in some more complex scenarios, you'll need to setup a duplicated stack before being able to switch the traffic in your load balancer. This strategy is usually referred to as Blue Green Deployment due to Martin Fowler's article.

In our case, our application didn't depend on a database or any data migration so we could adopt an incremental rolling update. AWS doesn't handle the termination of existing instances when you decide to update the Launch Configuration of the Auto Scaling Group so you'll need to do it yourself. You can see in the animated gif how we expect to update our currently running instances.

elb update

Ansible provides options to do rolling updates of AMI but it was lacking some features to inspect the currently running Launch Configuration or Auto Scaling group. It was necessary for us to delete the old launch configuration and keep the current number of instances running. We opted to call the AWS CLI within ansible instead. We wrote the following playbook for Ansible to do a rolling update of instances. It should be generic enough to help you get started.

- name: "Create Launch Configuration, ELB, Auto-Scaling group and rolling update of AMI"
  hosts: localhost
  tags:
    - aws
  vars:
    - AMI: ami-xxxxxx
    - region: us-east-1
    - app: "coolapp"
    - security_groups:
        - sg-xxxxx
     - subnets:
        - subnet-xxxxx
    - new_lc: "{{app}}-{{ AMI }}"
    - min_instances: 2
    - desired_instances: 2
    - max_instances: 4
    - instances_type: t2.small
    - iam_profile: "your_instance_iam_profile"
  tasks:

    - name: Get all the Auto Scaling Groups from awscli
      shell: aws autoscaling describe-auto-scaling-groups
      register: asg_output

    - name: Get all the instances handled by an Auto Scalling Groups from awscli
      shell: aws autoscaling describe-auto-scaling-instances
      register: asg_instances_output

    - name: Initialize the current Launch Configration and Auto Scaling Group to Null
      set_fact:
        current_lc: null # this necessary for the first update when the ASG doesn't exist
        current_asg: null
        running_instances: 0

    - name: Find all the matching Auto Scaling Group
      set_fact:
        asg_list: "{{ (asg_output.stdout
                        | from_json)['AutoScalingGroups']
                        | selectattr('AutoScalingGroupName', 'equalto', app)
                        | list }}"

    - name: Find current Auto Scaling Group and Launch Configuration
      set_fact:
        current_asg: "{{ (asg_list | first)['AutoScalingGroupName'] }}"
        current_lc: "{{ (asg_list | first)['LaunchConfigurationName'] }}"
      when: asg_list | length > 0   

    - name: Check if Launch Configuration needs to be updated
      set_fact:
         update_lc: "{{ new_lc != current_lc }}"

    - name: Get Running instances in the Current Auto Scaling Group
      set_fact:
         running_instances: "{{ (asg_instances_output.stdout
                                 | from_json)['AutoScalingInstances']
                                 | selectattr('AutoScalingGroupName', 'equalto', current_asg)
                                 | list
                                 | length }}"
      when: current_asg and update_lc

    - name: Update desired instances if more instances are currently running
      set_fact:
         desired_instances: "{{ running_instances }}"
      when: running_instances > desired_instances and update_lc

    - name: Describe Launch configuration
      shell: aws autoscaling describe-auto-scaling-groups
      register: asg_output

    - name: Create or Update Elastic Load Balancer
      ec2_elb_lb:
        region: us-east-1
        name: "{{ app }}"
        state: present
        connection_draining_timeout: 120
        scheme: internal
        security_group_ids: "{{ security_groups }}"
        subnets: "{{ subnets }}"
        listeners:
          - protocol: http
            load_balancer_port: 80
            instance_port: 80
        health_check:
          ping_protocol: tcp
          ping_port: 80
          response_timeout: 29
          interval: 30
          unhealthy_threshold: 4
          healthy_threshold: 2
    - name: Create Launch Configuration
      ec2_lc:
        name: "{{ new_lc }}"
        region: "{{ region }}"
        image_id: "{{ AMI }}"
        security_groups: "{{ security_groups }}"
        instance_type: "{{ instances_type }}"
        instance_profile_name: "{{ iam_profile }}"
      when: update_lc
    - name: Create or Update Auto Scalling Group
      ec2_asg:
        name: "{{ app }}"
        region: "{{ region }}"
        launch_config_name: "{{ new_lc }}"
        health_check_period: 300
        health_check_type: ELB
        desired_capacity: "{{ desired_instances }}"
        min_size: 2
        max_size: 10
        vpc_zone_identifier: "{{ vpc_subnets.subnets|map(attribute='id')|list }}"
        load_balancers: "{{ app }}"
        replace_all_instances: true
        wait_for_instances: true
      when: update_lc
    - name: Delete Old Launch Configuration
      ec2_lc:
        name: "{{current_lc}}"
        state: absent
        region: "{{ region }}"
      when: update_lc and current_lc # current_lc is now the old launch config

In conclusion, if you have a use case which doesn't depend on any data migration, then this method works pretty well. The downside is that based on the number of instances, the update can take some time due to having to replace every instance one by one and waiting for them to appear healthy in the ELB.

If you're interested in working with our team on any of this, we're currently hiring!

If you have any feedback or you find any issues in this post, let us know.