Code the Infra

Saturday, April 13, 2013

Code the infra. There is no other way to make operations predictable and repeatable. The opposite of coding the infra is what I call as “box hugging”. If you log into boxes to configure, install packages, start/stop services, or do any maintenance, you are a box hugger. Coding the infra requires that you treat automation artifacts (shell scripts, puppet manifests, fabric scripts etc) and configuration as code. If you’ve no repeatable code to bring up bare infra into a desirable operational state, then you are a box hugger. Box hugging is a bad habit, and is bad for the business. It makes recovery from failures time-consuming. It does not scale with needs. Most fat-finger and admin cockup related outages start with box hugging. Sure, it may have worked 100 times, but just one fat-finger mistake is enough to make your team’s life miserable.

Two steps to cure box hugging — first, internalize the idea that the box you’ve just finished setting up meticulously is going to burst into flames the very next minute, second treat operations the same way as you would treat software development.

Coding the infra is not hard.

1. Treat infra as ephemeral

Infra is not permanent. It will fail. You can estimate MTBF with some assumptions, but failures won’t follow estimates. MTTR is more important that MTBF. When you treat infra as ephemeral, the act of bringing up new infra to a desired operational state becomes a normal and known practice. You ignore the dead nodes, and focus on bringing up new nodes as quickly as possible.

2. Think of system setup as a series of state changes

Start with the basic infra, and apply a sequence of steps to change the state of the infra to bring it to a desired state. The steps could be installing packages, configuring them, starting servers, setting cron jobs and so on. This is no different from most coding exercises — start from a known state, apply some computations, and arrive at a new state.

3. Make the steps repeatable

This is like coding any math problem. You first solve it on paper to arrive at an algorithm. Then you would code the algorithm so that you can repeat it every time you need to solve the same math problem again. It is the same with operational changes. It might seem time-consuming to treat operations this way, but unless you make the steps repetable through automation, you can’t recover from failures easily. Repeatability is a way of rehearsing recovery. Node died? Cool — just run the automation to bring up a new node. You’re back in business.

4. Implement idempotency

Repeatability alone is not sufficient when the state changes are numerous. You need to make each state change idemotent. Apply the same change again — the system should not burn up. Practicing idempotency makes the outcome certain. If something breaks in the middle you can replay the whole sequence of changes when you know that each step is idempotent.

5. Review, test and version control

Finally, apply the same engineering rigor to automation artifacts as you would apply to software development — that is, ensure that the automation scripts are peer-reviewed, tested and maintained in source control. There should be no difference.

DevOps is not just about integrating dev and ops, but is about treating operations as development, and development as operations.

because writing is clarifying

Code the Infra