Running ansible at scale

16 Dec 2016

Last week I deployed my first “at scale” playbook. The overall objective was simple: Add new dhcp helper address to about 400 switches. Like most things though, the devil is in the details. Right off the bat, I ran into “non-ansible” related issues (tacacs/ssh). That brought the number of devices to about 270. Not a big number right? At the most basic level, yes, if I was simply pushing configs using the ios_config module. ##Breakdown of the playbook

Execute a show run on the device and compile a local backup for each device
Run a pre-flight report, specific to the interfaces we are going to impact (multiple ssh sessions per host)
Build the configs locally
Deploy the configs
Validate the configs/Unit testing

A bit more about the unit testing:

For testing the changes were deployed, I had 2 criteria:

Assert that the new helpers are present within the interface configurations (of the specific interfaces)
Assert that the startup and running config are in sync (in other word, the new config has been saved)

For assertion 1, I took the approach of running a show running interface per interface - this implied multiple ssh sessions per host.

##Observations:

Running the playbook for backups result in a lot of SSH connection failures on the first run. Subsequent runs are significantly more successful - Still see some failures
Running the playbook for the preflight report/validation, results in ssh timeouts - these are not consistent across hosts: Meaning, for the same host, the show run int for Vlan101 will work but might fail for Vlan201 on the first run, but on the next run, there is no guarantee that a repeat play will reproduce this exact failure
My validation role uses dynamic includes like this example. Running the playbook with a tag other than “validate”, still attempts to load all yaml files and results in failure.

##Tweaks and next steps:

I had mixed success with using the “serial” option. Needs more trial and error.
Rewrite pre-flight and validation scripts to only do a single show run. Collect specific interface details from a local copy
Tried pipelining and some other recommendations that seemed relevant based on this. However, it appears to be focussed on using ssh connections to the remote systems. As we know, for the ios_* modules, ansible ssh’es to the local host and then uses paramiko within the modules. In short, pipeline did not seem to do much
Setting the timeout parameter for the ios_* modules seemed to have no affect on the ssh timeouts.
To further understand observation 1 (which is still the most vexing one), I tried connecting to the device using paramiko from the python interpreter and executing the same commands. I could not recreate the issue. I had good connectivity each time

##Other issues: For observation 3, I opened an issue with ansible. Based on the comments it was closed with, I guess, it is an expected behavior. Which means, for any playbook that has dynamic includes, we need to remember to send any variables that role will need, even though your tags may not be calling that role.

Twitter: Share it with your followers or Follow me on Twitter!

Network Automation Archive

Network Automation

Running ansible at scale

A bit more about the unit testing:

Network Automation Archive

Network Automation

Running ansible at scale

A bit more about the unit testing:

Related Posts

Guest Blog! Dynamic Surveys Sort of.... 28 Jun 2021

The idea of a "CareTaker" for git centeric network operations 28 Nov 2020

Cleaning up pending Tower Jobs 03 Apr 2020