Documenting Jello: How we automated our infrastructure documentation.

When I first started with Trōv, we had a very small infrastructure footprint, so things were very easy to find and explain. Over time that has changed, and continues to on an almost daily basis. For one, we settled on AWS as our cloud infrastructure provider. It soon grew from a collection of a few EC2 instances in a single region, to dozens of instances for multiple environments (Development, Staging, etc) across multiple AWS regions.

Aligning with our CTO's mantra “It doesn’t exist unless it is written down”, we needed to document our infrastructure in a way that was easy to maintain, and one that would be flexible enough to change as our requirements, or our infrastructure itself, changed.

We started out using Visio to make some pretty network diagrams, which was great at first. Except that our infrastructure is fairly fluid, so they quickly became outdated. They also lacked any sense of being connected with our actual infrastructure. What I mean by that is, I wanted our network documentation to look as if someone had walked into our "datacenter" and taken a picture of all of our servers. Every few minutes. That tells people that what they are looking at actually exists right now, not 8 weeks ago when the static (aka stale) document was last edited.

One of my more recent role acquisitions here at Trōv is that of automation. I love looking at a problem, or an existing (manual) solution, and finding a way to automate it. I firmly believe that we should always be striving to automate manual processes whenever and wherever we can.

So what is an automation guy doing manually editing network documents? Good question. So I went looking for a solution, one that would document our network, with little if any intervention. Any of you who have looked for this kind of solution know that there are very few out there. One by one they failed to meet the requirements I had set, namely:

  • Don't take more time to setup and administer than the current process.
  • Be flexible and configurable enough to change quickly and easily.
  • Don't add unnecessary complexity.
  • Be completely automatable.

To be fair to some of the software that I tested, our infrastructure is all virtual now, so we don't really have any requirement for the standard L1/L2/L3 layers to be documented, so that may be why a lot of them simply didn't know how to visualize what we have. Be that as it may, I still needed to document our environments and had found nothing that did it the way I wanted.

So, convinced I wasn't about to reinvent the wheel, I set about to write something that would be able to go out to any of our different environments in AWS and document them all. And thus was born Document.AWS. So what does this do differently than the traditional Visio documents?

For one thing, it is near real-time (I have ours currently set for every 5 minutes). This ensures that if any of our employees are looking at the document, it's a very near real-time representation of our infrastructure. If they heard of a new Mongo replica set in Staging in their stand-up today, they can see it right away on the network document, instead of waiting on a manual update.

An obvious one is that it is automated, taking the human factor out of the equation. No more chances of me putting the wrong machine in the wrong subnet, or making a typo on the IP addresses of a few of the instances. It is completely automated and correct, and changes as the AWS environment does.

Another advantage is, it frees me up to do other things. Unless I have any plans to extend or reconfigure it as our requirements change, I spend no time at all documenting our networks anymore. And the time spent configuring or extending it is much less than the manual process of documenting.

OK, that's all great and everything, but what does it look like? First I created a simple index page that contains a map of the world with an image map of AWS regions (all AWS regions are listed, feel free to remove any you don't use):

Here is a look at some sample data, including the tooltip text that contains all the Security Group rules for the instance in question:

There are some things it doesn't do yet, like linking subnets/instances in the manner of a traditional network diagram. I think the grouping by subnets is enough, for us anyway. So I'm not sure I'll be working on that anytime soon, but if it is something you'd like to see, feel free to put in a PR for it! I'm sure there are lots of other things it doesn't do yet (or perhaps well) that could be done, but we needed to start somewhere and Document.AWS is that starting point. So jump over to Github and join me in making it even better!