Turns out I was a little rusty on remembering how Ansible become works, and how I should apply it to the Ubuntu image I was using on EC2 instances.

In this post, I was using this command:

sudo awk '/root\s+ALL=\(ALL:ALL\) ALL/ {print; print "ubuntu ALL=(ALL) NOPASSWD:ALL"; next}1' /etc/sudoers | sudo tee /etc/sudoers

to give myself passwordless sudo through Ansible. Notable things:

  • I was running this using the User Data setting, which runs scripts on an instance’s first boot.
  • The script was failing intermittently.

I’ll get to the fix first:

The Ubuntu distribution provided on AWS already has a user, “ubuntu”, with passwordless sudo! I wasn’t setting up the Ansible configuration correctly to use it, but learned some things along the way.

For sudo commands, I merely needed to add this to my tasks:

become: true
become_user: ubuntu

I had two reasons for my mistake.

First, I had wanted to play around with User Data for EC2 instances, even though I thought running that particular script could be problematic. Normally, you should verify changes to /etc/sudoers with visudo -c. Even if you did that, you can’t do much more than log the failure when running this command with User Data. In Ansible, you could do somethings like this:

tasks:
  - name: Add passwordless sudo
    lineinfile:
      line: '%sudo ALL=(ALL) NOPASSWD: ALL'
      path: /etc/sudoers
      regexp: '^%sudo'
      state: present
      validate: 'visudo -cf %s'

If the task fails, you’ll know on running it and can fix the problem.

Second, I’m pretty used to doing privilege escalation with a password and an admin user! I’ve always been a little uncomfortable with not using passwords and essentially just relying on an SSH key, but it’s a lot easier to work with on throwaway instances.

User Data Logging for User Data Failures

I had first thought that I would try to debug why my User Data was intermittently failing. AWS logs your instance startup stuff here:

/var/log/cloud-init.log
/var/log/cloud-init-output.log

From the instance dashboard in the console, you can also go to “Actions” -> “monitor and Troubleshoot” -> “Get system log” to take a look at the instance’s log in the console.

It turned out that I didn’t debug my User Data failures, since I found the better solution above.

I did find that I seemed to have more failures running on a pretty bad wifi connection, but my problems later persisted on a good connection, as well. Since this didn’t happen on every playbook run, I thought that User Data might applied concurrently with other provisioning. If this was the case, perhaps one of them would “win” which meant that sometime User Data would succeed, and sometimes it would fail. I will save that investigation for another day!