Thursday, May 26, 2011

"Chaos Monkey" helping keep Cloud Services running

Recent hot news was that Netflix has become #1 user of internet bandwidth.
Another recent new was that one of data centers of Amazon Web Service EC2,
otherwise a great service, was down for a few DAYS.
Netflix is apparently using AWS, and it didn't slow its hunger for bandwidth.
How?

"Chaos Monkey" is helping them...
Coding Horror: Working with the Chaos Monkey


We’ve sometimes referred to the Netflix software architecture in AWS as our Rambo Architecture. Each system has to be able to succeed, no matter what, even all on its own. We’re designing each distributed system to expect and tolerate failure from other systems on which it depends
...
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage...



Similar story, here is another reason why Stock Overflow is a great web site.
They had a long-running problem with network, that was used as an opportunity
for improvement...

Every week that went by, we made our system a tiny bit more redundant, because we had to. Despite the ongoing pain, it became clear that Chaos Monkey was actually doing us a big favor by forcing us to become extremely resilient. Not tomorrow, not someday, not at some indeterminate "we'll get to it eventually" point in the future, but right now where it hurts.


A wisdom quote:
That which does not kill us makes us stronger.
- Friedrich Nietzsche