Understanding Elixir OTP Applications - Part 2, Fault Tolerance

In this mini-series I'm going through the core principles behind the Erlang and Elixir languages piece by piece. In this article I will discuss Elixir's approach to fault tolerance through a philosophy they refer to as "let it crash".

In the last article I discussed how Elixir handles distribution through isolated processes and message mailboxes. Now we'll take a look at how Elixir makes those processes fault-tolerant.

Previously we created a KeyValueStore that implemented a GenServer interface. We didn't take care of errors in that module. In some situations it's a good idea to be defensive, other times it's counter-productive. We cannot know in advance all of the issues that might occur in our processes so defending against all exceptions can be an exhausting task. In Elixir it's encouraged to "let it crash" and then restart the process in a fresh state. This makes sense in a lot of contexts, e.g. if the network drops then we just restart the process until the network reconnects. Of course there are also situations where it's best to handle some known edge-cases.

We can use Supervisors to watch our processes and reboot them if they crash. A Supervisor is a type of Process that monitors another Process and then takes some action if the child Process crashes. The Supervisor will manage all life cycle events for the child Process, including startup and shutdown.

Let's add a Supervisor to monitor our KeyValueStore

defmodule KeyValueStore.Supervisor do
  use Supervisor

  def start_link(opts) do
    Supervisor.start_link(__MODULE__, :ok, opts)

  def init(:ok) do
    children = [KeyValueStore]

    Supervisor.init(children, strategy: :one_for_one)

You might notice a couple of things here.

What are these children? A Supervisor can manage more than one Process. Each of the Processes the Supervisor manages are called children.

What is a one for one strategy? The strategy defines the action that happens when a child terminates: 
  • :one_for_one will restart only the process that terminated.
  • :one_for_all which will terminate all and then restart children when any child process terminates.
  • :rest_for_one which will restart the process that terminated and terminate and restart all siblings after the terminated process e.g. if we have the following

children = [Module1, Module2, Module3]

    If Module1 crashes, all of the processes will be restarted. However, if the Module2 process crashes only the last 2 processes will be restarted.

Starting a Supervisor

When we start the Supervisor, it will call child_spec/1 on each of its children. The child_spec/1 function will return a specification for how the process should be started. This function is automatically defined for us because we are using the GenServer module in our KeyValueStore.

def child_spec(arg) do
    id: __MODULE__,
    start: {__MODULE__, :start_link, [arg]}

If we want to start the process in a different way, we can override this function to customize the args passed to on start. The Supervisor expects the child_spec to return a map with at minimum :id, to identify the process, and :start which defines how the module will be started. You can also pass :restart, :shutdown, and :type. I won't go over those here.

We can see from the child_spec/1 function above that that the start key defines how the module will be started. The contents of the map will be passed to apply/3. Later on we'll update the child specification in our Supervisor and pass a name to our KeyValueStore GenServer.

Ok, now that the Supervisor has collected all of the child specifications for the children one by one, starting with the first child. It will use the value from the :start key in the child specification map to start the linked process. e.g.

  id: KeyValueStore,
  start: %{KeyValueStore, :start_link, [name: KeyValueStore]}

Let's Take Our KeyValueStore For A Spin

We need to make a small change to our KeyValueStore GenServer. Previously we used GenServer.start/3 to spawn the process, creating an unlinked process that lives outside of a supervision tree. We want to create a process linked to our supervisor so let's change the GenServer.start/3 function to GenServer.start_link/3. The Supervisor expects our module to expose this through a start_link function where the single argument is an options map.

defmodule KeyValueStore do
  def start_link(opts) do
    GenServer.start_link(__MODULE__, :ok, opts)

This change also allows us to name our processes from our Supervisor. Let's do that too.

children = [
  {KeyValueStore, name: :primary_key_value_store}

Ok, now let's spin it up and let make it crash.

|> KeyValueStore.set(:foo, :bar)
=> :ok
|> KeyValueStore.get(:foo)
=> :bar
GenServer.call(:primary_key_value_store, :invalid_call)
=> [error] GenServer :primary_key_value_store terminating
=> Last message (from #PID<0.501.0>): :invalid_call
=> State: %{foo: :bar}
=> Client #PID<0.501.0> is alive
|> KeyValueStore.get(:foo)
=> nil

Great! Now our process automatically boots itself back up when it crashes.

What's next?

We've covered how Elixir manages distributed systems and fault tolerance, next we'll take a look at why Elixir is considered soft real-time.

Need help?

Do you have an Elixir project that you need help with? Get in touch with us to see how we can help.

Ready to start your project? Contact Us

Like 3 likes
Joe Woodward
I'm Joe Woodward, a Ruby on Rails fanatic working with OOZOU in Bangkok, Thailand. I love Web Development, Software Design, Hardware Hacking.

Join the conversation

This will be shown public
All comments are moderated

Get our stories delivered

From us to your inbox weekly.