Failures in computations are due to hardware, software, and human cause failures. A single cause for such errors is difficult to quantify and depends strongly on circumstances. Therefore we can conclude that
- failures are in principle unavoidable and that
- we must take care of all three sources of errors .
An actor system consists of computationally separate and concurrent entities. If one actor fails, the system does not crash immediately as do sequentially organized applications. Other actors can continue their tasks as long as they do not try to communicate with the failed one. The system now is in a problematic state, and we must somehow prevent further cascading failures.
The solution is not to defend against errors but to organize the system such that actors
- monitor each other for failures and
- perform corrective actions if a failure is detected.
|Actors connect and propagate an
Exit to each other.
|Actors can monitor other actors and tasks.
|Actors can be supervised and restarted.
|Actors can save checkpoints to checkpointing actors and restore them.
monitor are called from the REPL or a user script and not from an actor, the given link will be connected to or monitored by the
- 1Egwutuoha, I.P., Levy, D., Selic, B. et al. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput 65, 1302–1326 (2013). https://doi.org/10.1007/s11227-013-0884-0
- 2An outline of actor-based error handling is Joe Armstrong's dissertation: Making reliable distributed systems in the presence of software errors
- 3For implementation see also Joe Armstrong 2013. Programming Erlang, 2nd ed: Software for a Concurrent World; Manning, chs. 13 and 23 as well as the Erlang/OTP and Elixir online documentations.