Treat infra as code. This is one of the key maxims told in the devops circles. This principle advocates to treat infrastructure as code, to automate everything and to apply the same engineering processes and principles as you would apply to software development. While it is extremely important to practice this principle, drift is bound to catch you by surprise when operating at scale. That’s what I learned while building and operating the cloud at work.
Here is the most ideal outcome of automating everything. Every node in the system is automated, has the right configuration, gets updated in synchrony with others, and everything is as it supposed to be, right from day one. But this a naive view.
Treating infrastructure as code, and automating everything does not mean that everything agrees with that desire and behaves as told in practice. Though it is less difficult to control drift at small scale, when you’re operating thousands of nodes of different types in multiple locations, the chances for configuration drift are high, and it requires conscious effort to be aware of, to manage, and to mitigate drift.
To give an example, the system I deal with is a private cloud deployed across several geographically distributed availability zones. Overall, there are over 30 types of nodes that are managed through automation. While some of these nodes are stateless running in clusters behind VIPs, a few are stateful nodes with active/active or active/passive configurations. A large subset of nodes are hypervisors with some variations in their configuration to support different types of workloads. As the hypervisors are long-lived, they go through more number of changes during their lifetime than other types of nodes. A few nodes are software appliances.
All these nodes go through different rates of change made by different teams. Moreover, not all nodes manage their configuration state the same way. Though most load configuration from configuration files, some types of nodes are initialized through databases, and some through API calls.
Your experience may vary, but in the systems that I deal with, drift is a bigger deal than originally imagined. Here are some of the consequences of drift.
- Bad user experience: For instance, a few badly configured nodes out of hundreds or thousands may only impact a few customer interactions. Such silent bugs are hard to detect as they may not show up in overall system metrics and KPIs.
- Incidents waiting to happen: Incompletely or inconsistently applied changes can mask problems and eventually lead to incidents. The specifics vary from system to system, but I’ve my stories to tell.
- Impact on time to recovery: Even worse, drift can impact time to detect failures and time to recovery from those. Most unplanned drift discovery also happens during incidents.
Where does drift come from
There are four key contributors to drift.
Automation gaps and bugs
This is the most natural source of drift. Every automation gap is a potential source of drift. Like regular software development, automation too goes through an iterative development process, and gaps are natural consequence of that iterative process.
More over, the act of mutating a node’s configuration in place may leave cruft behind eventually leading to drift. Immutability can help mitigate such drift in some cases, but immutability is not an answer for everything. It is also expensive to implement in certain cases.
Human error and (bad) habits
This is largely an issue of culture and past habits. The causes in this category include debugging on live systems, and people making ad hoc changes that bypass configuration management and change control. Such drift is likely to remain uncaught until it leads to a noticeable issue or an incident.
During incident management, the focus is on time to recovery and not automation and change control. The act of recovery is likely to introduce drift.
When operating at scale, not every node and not every service is likely to be updated at the same time. You may also need to stagger certain changes over weeks or months to leave room to observe, tweak or even rollback.
How to manage drift
First, don’t deny drift. Acknowledge that drift is a possibility and that automation may be incomplete or buggy.
Second, build tools to regularly audit for drift. Automation is expected to reduce drift, but like most things, automation too may be work in progress and shall have bugs. Use audits to discover the state of drift. Awareness is a prerequisite for mitigation.
Third, extend the “measure everything” maxim to include drift. At any given point in time, be able to know what nodes/systems are in drift, and assign severity based on their potential impact. Wire up these metrics to your alerting systems so that the team gets alerted when drift is discovered.
Fourth, make drift mitigation a planned activity, sprint after sprint. Use drift metrics to track the mitigation progress.
Finally, reward right habits.