DevOps, Postmortems and Cloud Spend

Thursday, July 27, 2017

As I wrote previously here, here, and most recently here, I’m a strong advocate and practitioner of “cost awareness as part of DevOps culture”. Cost related activities like choosing the architecture with cost in mind, forecasting for scale, and optimization to reduce waste are some of the several activities that DevOps teams need to conduct in order for autonomous teams to succeed.

In this culture, spikes or unexpected patterns in cloud spend are like production incidents. What do you do when you’ve a production incident? Once you restore the system, you conduct a postmortem, ask the whys, make observations, note lessons learned, and take corrective actions for the future.

That’s exactly what we did yesterday as we were analyzing numbers for a prior month. We observed that the spend is not in line with the expected. We conducted a postmortem asking the whys. Here is a simplified version.

Issue: We spent more that we expected by a certain amount.
Metrics observed:
Ratio of reserved (EC2 and non-EC2 combined) to on-demand instances: Fell from 71% to 64%
Utilization of reserved instances: Remained at 96%
Volume of compute: Increased as expected with forecast
Price per unit of compute (an aggregate metric to spot trends): Increased from $x to $y
Compute vs network costs: Marginally increased inline with forecast
Whys:
1. Why did the ratio of reserved to on-demand instances fall?
Because we ran out of reserved instances for certain instance types, and paid on-demand price.
2. Why did the reserved instance utilization remain the same?
Probably because some teams switched instance types.
3. What are the most expensive instance types in the month?
r4.2xlarge, …
4. What is the reservation coverage and utilization of the most expensive instance type?
39% coverage, and 100% utilization.
5. Which teams use that instance type, when did they switch, and what were they using before?
Team X switched from another instance type with 74% coverage with 100% utilization during the middle of the month.
… …
6. Why did we not observe this sooner to take corrective action?
Due to billing delays, our weekly review cycle did not spot the increase. Since the purchasing cycle is also time consuming, we could not have the corrective actions in time to influence the current month.
7. Why did those teams change from instance types that have more coverage to instance types that have less coverage?
They didn’t know. One of the teams thought that they were saving money by switching instance types while getting better CPU to memory ratio that they needed. They didn’t have access to the reservation pool, and even those with access to data could not tell if/how their usage has an impact on the overall reservation pool.

We still had more questions and some hypothesis to validate, but we also found some smoking guns that could explain the change. What did we learn from this?

On any cloud, as a team using cloud services, you are still responsible to design for, forecast, and optimize for cost.
The shorter the feedback loop between spend and the teams creating and running software, the better. Our current feedback loops are not short enough.
Savings through reserved instance purchase is a complex game, and you can only play it for a while. There are not enough tools in the world to play this game efficiently and forever.
Regardless of reserved instances, the portfolio of services on public clouds is still evolving. The pricing models have room to evolve.
As we see in phone and cable bills, price complexity helps the providers, and not the consumer. Whenever you look at these bills, you would always ask, am I paying more than I should, and am I subscribing to services that I don’t need. Definitely not a good feeling.
Stringent measures like governance committees and capacity approval boards are not an option. They spoil the culture.
Simplicity, where are you?

because writing is clarifying

DevOps, Postmortems and Cloud Spend

See Also