Development and operations (DevOps) tools such as Puppet and Chef automate changes to configurations in systems. Some teams use these tools, and other frameworks, to actually automate the creation of the entire production Web server environment-sometimes in public services such as Amazon Web Services, sometimes in a local environment.
The problem with automating the rollouts is that code has bugs. A configuration change meant for QA-say, to direct users to a test environment-could be propagated to production, leaving users logged into an environment that looks real but will never actually ship products. (Don't laugh too hard: Last month this happened to one of my clients, a multibillion-dollar retail operation moving to customer self-service.)
The good news is that, as new risks emerge, so, too, are new techniques to manage those risks. Here are a few things to think about in any cloud transition.
99 problems-and code defects are one
Amazon's Elastic Compute Cloud has occasional, unpredictable outages. Even without Amazon, if a company uses Chef or Puppet to automate system administration, those tools use code, and that code could have defects.
Here are a few possible problems with a cloud implementation:
- A feature is created in production but disabled by a configuration flag. A programmer turns on the GUI, but the behavior remains "off."
- A private cloud manager designed to roll out new servers over time has a defect in the "reaper" process that turns off old instances.
- Mistakes in the merge process can put test configurations such as databases, server names and URLs into production.
- API issues, especially a third-party API that changes after the code "passes" the test environment.
All these problems could appear first in production. In fact, they're likely to first appear in production, with no visible signs in the test environment. A week of phone calls, interviews and a trip to San Diego to discuss this in person at the Software Test Professionals Conference have led me to conclude that there are no easy answers.
A traditional test approach won't find these problems. Instead, the people I interviewed recommended two things: Either change the architecture to reduce risk or monitor, test and (quickly) fix issues in production.
For better software stability, change the architecture
The problem, at least according to Adam Goucher, isn't that cloud computing introduces new risks. Rather, it's that it requires a different kind of thinking. Goucher should know; as a consultant for Selenium, he has presented full-day tutorials on the topic of cloud services, and recently authored Testing for the Cloud for the Pragmatic Bookshelf.
Goucher suggests a culture change from modifying configurations by hand to using automation tools all-out in every direction. "Some people just aren't comfortable with continuous delivery, so they take a piecemeal approach to architecture, borrowing one tool or another," he says. "Build a private cloud with Chef, but do DB migrations by hand, [and] that's going to introduce instability you don't have to introduce."
Sign up for MIS Asia eNewsletters.