Wednesday, April 04, 2012

Erlang reliable 99.9999999% (9 nines)

How reliable is nine "9"s of Erlang?
That is 3 seconds downtime in 100 years!

It is not surprising that
  • Facebook is using Erlang for messaging service (845 million active users)
  • GitHub back-end is running on Erlang
  • Amazon is using Erlang for various tools, inlcuding SimpleDB
  • CouchDB is written in Erlang
  • T-mobile is using it
  • Riak NoSQL is on Erlang

    What's all this fuss about Erlang?
    - The Pragmatic Bookshelf
    (by Erlang creator)

    podcast about Erlang @ DotNetRocks

    Introduction to programming in Erlang (1), (2)

    Erlang is a "pure" functional language, where only way to share data is by sending messages,
    objects are immutable, and processes are very cheap / small: 1 KB size!
    By contrast, Windows thread (part of a process) takes 1 MB of memory,
    and data sharing requires extensive locking...

    Syntax of Erlang is based on Prolog language, very different from C-based languages.
    And very compact. Since 'variables' can not be changed, every assignment is by value (copy)
    and this comes with some performance cost. On the other side, system is asynchronous,
    so overall performance of the system is apparently good.

    To optimize performance, parts of CouchDB are now being re-written from Erlang in C/C++.
    Apparently even Ericson attempted similar when a major Erlang project was released,
    and they soon gave up, realizing that with C/C++ could not get required reliability...

    Erlang can be used on Linux/Unix, Mac and Windows.
    Windows install is simple, one click.
    Interactive environment is "unix-style", so on Windows it requires some adjustment.
    Here is a useful Erlang/Windows info @ StackOverflow.
    (Yes, you have to use forward slashes. Backslashes are special in Erlang strings.)

    Apparently Amazon AWS SimpleDB is build on Erlang.

    Maybe Microsoft can consider Erlang for some critical Azure services?

    Recent long Azure downtime (16 hours) caused by "leap year" security bug
    maybe could not be avoided in any language, unless there was a test case for it...
    On the other side, such extra security feature + cascading effects should not be part of core platform... To make thing more troubling, Amazon had similar issue about a year ago,
    cascading downtime that lasted for a few days.
    In both cases damage was done by auto-recovery features, that cascaded.
    This starts to resemble HAL 9000 from Space Odyssey 2001
    Programs blindly following instructions... Time for more "systems design".