If we want actors to be restarted automatically if they fail, we can use
supervise to put them under the supervision of a
supervisor actor. Lets consider a system of six actors
A6 under supervision of the supervisor
A10. Such a system may look as follows:
The six actors
A6 are connected to their supervisor A10 and will send it an
Exit message before they exit. They will not notify each other. It is the duty of the supervisor to decide, what to do.
What the supervisor does, should one of its child actors – say
A4 - fail, depends on three parameters:
- the supervision strategy (
:rest_for_one) of the
- the restart option (
:temporary) of the
supervised child and
Let's discuss the depicted case of actor
A4 failing: We assume
:transient child actors, meaning they are restarted if they terminate abnormally, that is, if they fail. Now what happens depends solely on the supervisor's strategy.
|the supervisor will restart ...
|the failed actor
|all child actors
|the failed actor
A4 and the actors
A6 registered to the supervisor after it.
In the second and third case the other actors are shutdown by the supervisor before being restarted.
You can see this in action in the Supervise Actors tutorial.
A failing actor transfers its behavior (behavior function and acquaintance variables) to the supervisor before it exits. Thus the supervisor can restart that actor with its state, that it had before processing the last message. That is also demonstrated in the Supervise Actors tutorial.
That strategy maybe sufficient for many cases, but you can change it by setting termination and restart callbacks.
There are cases where you want a different user-defined fallback strategy for actor restart, for example to
- restart it with a different algorithm/behavior or data set or
- do some cleanup before restarting it,
- restart after a node failure,
- save and restore a checkpoint.
For that you can define callback functions invoked at actor termination, restart or initialization:
term callback; if defined, it is called at actor exit with argument
reason (exit reason),
cb argument to
start_task is executed by a supervisor to restart an actor/task;
|if defined (and no
restart callback is given), the supervisor restarts an actor with the given
User defined callbacks must follow some conventions:
restartcallback does some initialization and spawns an actor or a task and returns a
Taskwhich again will be supervised.
initcallback is a startup behavior of an actor. It does some initialization or recovery and then switches (with
become) to the target behavior. A supervisor spawns a new supervised actor with the given
initbehavior and triggers it with
- A supervisor wants an actor running on a worker process (over a
RemoteChannel) to restart on the same or on a spare
pid(process id). In that case it calls the
restartcallback with a
pidkeyword argument (and the callback must take it).
After restarting an actor, a supervisor updates its link to point to the newly created actor. But copies of a link won't get updated and may then be out of sync.
If remote actors on other workers communicate with an actor over
RemoteChannels, they have copies of its link on their workers. After actor restart those are out of sync, and a remote actor may then try to communicate with an old failed actor. To avoid this situation, you should register supervised remote actors and use their registered names to supervise them and communicate with them. The supervisor then will update the registered link.
A supervisor takes two additional arguments:
max_seconds where you can limit the number of restarts it does in the
max_seconds time frame. If failures exceed that limit, a supervisor will shutdown its child actors and itself with an error message.
For larger applications you may be interested in building a hierarchical structure containing all actors and tasks. This is called a supervisory tree, and there is the
Supervisors package facilitating to build that.