Error Updating Route53 CNAME

8 April 2015, Rhodri Pugh

We use EC2 for our server infrastructure, and in the current iteration of our environment couple that with Route53 to provide simple well known names for our boxes.

The Implementation

The way this is implemented is each server, provisioned with Ansible, has an init script that queries for instance metadata and…

  • On startup adds/updates the Route53 CNAME
  • On shutdown deletes the Route53 CNAME

The deletion of the record isn’t essential, but it’s there to keep it clean if possible. Here’s a snippet showing the creation of a CNAME…

1
2
3
4
5
6
7
8
9
availability_zone=$(ec2-metadata -z | cut -d ' ' -f2)
region=${availability_zone%?}

name=$( ec2-describe-tags --region $region \
                --filter 'resource-type=instance' \
                --filter "resource-id=$(ec2-metadata -i | cut -d ' ' -f2)" \
                --filter "key=Name" | cut -f5 )
route53 change_record $HOSTED_ZONE_ID ${name}.${DOMAIN_NAME} \
            CNAME $(ec2-metadata -p | cut -d' ' -f2) >/dev/null 2>&1

This works great, and in our modest environment means we can identify and access servers easily.

The Problem

The problem we were seeing was that if an instance was stopped and then started (as we occasionally do if we need to change the instance type) the CNAME in Route53 would not be updated, it would remain pointing to the old address.

Now this has never turned out to be a massive problem as we rarely do this instance stop/start, usually brand new instances are provisioned and the old ones discarded. So what we’ve done when it does come up is manually remove the Route53 entry, then SSH to the server in question and run the script to create the CNAME.

Debugging

I said this has never been much of a problem, but it’d gone on for long enough that I finally decided it was time to get it fixed (as well as being an annoying manual process there was something I didn’t understand about it, which meant if it ever happened when things were on fire it could add to the problems).

Digging into the init script I removed the output piping and started to see this message coming back from Route53.

1
2
3
4
5
6
7
8
9
10
11
12
<ErrorResponse xmlns="https://route53.amazonaws.com/doc/2013-04-01/">
    <Error>
        <Type>Sender</Type>
        <Code>InvalidChangeBatch</Code>
        <Message>
            Tried to create resource record set
            [name='server.domain.net.', type='CNAME']
            but it already exists
        </Message>
    </Error>
    <RequestId>324hk3-2h34j-234jkl-2j2j22</RequestId>
</ErrorResponse>

This is confusing… it’s complaining that it can’t create the record because it already exists, but the change_record operation is supposed to upsert.

I then echo’d out the actual command that was being run.

1
2
route53 change_record ABC123DEF server.domain.net CNAME \
            ec2-52-14-169-98.us-west-2.compute.amazonaws.com

Don’t look ahead, and see if you can spot the problem.

The Solution

Did you get it? Well, it was a guess when I tried it but here’s the correct version…

1
2
route53 change_record ABC123DEF server.domain.net. CNAME \
            ec2-52-14-169-98.us-west-2.compute.amazonaws.com

Yup, that’s right - it was a missing period after the server name.

:E

I don’t know if this is expected behaviour (as it works fine on initial creation) or a “bug” in the API, but it was tricky to diagnose partly because of the misleading server name (with the dot suffix) in the error message.

It’s working nicely now though, and we no longer have to feel that pang of guilt when we change an instance type and manually fix up the DNS :)