Cancellation and error propagation#
tractor supports trio’s cancellation system verbatim,
then extends it across process boundaries. If you know how to
cancel a task in trio you already know how to cancel an actor —
and its whole subtree — in tractor; the runtime’s job is making
that statement hold over IPC with every structured concurrency (SC)
guarantee intact.
The ground rules,
a remote actor is never cancelled unless explicitly requested (by a parent or peer), unless supervision demands it (an error triggered one-cancels-all teardown), or unless there’s a bug in
tractoritself (please report it!),(remote) errors always propagate back to the parent supervisor; nothing is silently dropped on the floor,
every spawned process gets reaped no matter how it dies; if you can create a zombie child process (without using a system signal) it is a bug.
trio cancellation, across the wire#
Locally everything is bog-standard trio: nurseries, cancel
scopes, timeouts. tractor adds exactly one twist: a cancel
scope can’t physically reach into another process, so the runtime
relays cancellation as messages. Concretely,
cancelling an actor means sending it a runtime-cancel request msg; the target then runs its own graceful teardown — cancelling RPC tasks, closing channels, exiting its
trio.run()— and acks the request back to the canceller,cancelling a single cross-actor task works through the
tractor.Contextlayer: eachctxtask-pair is cancel-scope-linked over IPC such that either side erroring or cancelling relays an equivalent error to the other side (see The Context: a cross-actor task pair for the gory details),a cancel is therefore always a request with an ack: the canceller does a bounded wait for confirmation and escalates if the peer is unresponsive (see the teardown ladder below).
One-cancels-all supervision#
An tractor.ActorNursery supervises subactors exactly like
trio nurseries supervise tasks: when one child errors, the error
propagates to the supervising block and all sibling subactors
get cancelled before the error continues bubbling up the (process)
tree.
One-cancels-all: no zombies, no lost errors.#
import trio
import tractor
async def assert_err():
assert 0
async def main():
async with tractor.open_nursery() as n:
real_actors = []
for i in range(3):
real_actors.append(await n.start_actor(
f'actor_{i}',
enable_modules=[__name__],
))
# start one actor that will fail immediately
await n.run_in_actor(assert_err)
# should error here with a ``RemoteActorError`` containing
# an ``AssertionError`` and all the other actors have been cancelled
if __name__ == '__main__':
try:
# also raises
trio.run(main)
except tractor.RemoteActorError:
print("Look Maa that actor failed hard, hehhh!")
What’s going on here?
three healthy actors are spawned as daemons via
tractor.ActorNursery.start_actor(); left alone they’d happily idle forever,a fourth actor runs
assert_err()via.run_in_actor()and promptly trips itsassert 0,the resulting
AssertionErrorships back over IPC as a serialized error msg and re-raises boxed inside the nursery block as atractor.RemoteActorError,the nursery reacts like any
trionursery would: it cancels the three healthy siblings (graceful runtime-cancel requests, acks awaited), reaps all four processes, then re-raises,trio.run(main)sees that sameRemoteActorErrorin the parent-most process — propagation is end-to-end or bust.
This one-cancels-all style is currently the only supervision
strategy offered (it’s the one trio gives you); more
erlang strategies are roadmap, see the bottom of this page.
The boxed-error bestiary#
All remote failures arrive locally as one of a small set of exception types, each carrying enough metadata to work out who failed, where, and why.
RemoteActorError#
The workhorse: a “boxed” exception relayed over IPC from another actor. The original error’s type, traceback string and msgdata are preserved so you can pattern-match on what actually went wrong remotely,
.boxed_type: the reconstructed type of the original remote exception (ValueError,NameError, what have you),.src_uid: the(name, uuid)pair of the actor where the error originated,.relay_uid/.relay_path: when an error crosses more than one actor boundary (grandchild -> child -> root) every relaying actor is recorded; multi-hop boxings are lovingly referred to as “inceptions” in the runtime internals,.pformat(): a rich “tb box” rendering of the remote traceback for your logs or REPL.
try:
async with portal.open_context(ep) as (ctx, first):
...
except tractor.RemoteActorError as rae:
if rae.boxed_type is ValueError:
... # the remote task raised `ValueError`
ContextCancelled#
The cancel-ack for a cross-actor task pair: raised when a
tractor.Context task is cancelled by request. Its
.canceller attr is the uid of the actor which requested the
cancel, which powers the key rule,
if you requested it (you called
tractor.Context.cancel()) the resulting ctxc is absorbed atopen_context()exit: an expected outcome, not an error,if anyone else did — the peer task, or some third-party actor — it raises locally so your code always hears about it.
The full self- vs. cross-cancel semantics are a core teaching point of The Context: a cross-actor task pair; go read them there.
MsgTypeError#
An IPC-payload “type error”: a msg violated the dialog’s declared payload spec. See Typed messaging for the typed-messaging system which enforces it.
TransportClosed#
The underlying IPC transport (TCP stream, UDS socket, …) died or closed out from under a channel. You’ll normally only see this surface when a peer hard-exits without any graceful runtime teardown; the supervision machinery treats unexpected transport loss on a busy channel as a failure and tears down accordingly.
Pick your blast radius#
Three cancel surfaces, three scopes of effect; choose the smallest hammer that does the job.
surface |
cancels |
typical use |
|---|---|---|
every subactor in the nursery |
whole-tree teardown |
|
one actor: full runtime + proc |
daemon teardown |
|
exactly one remote task |
surgical task cancel |
ActorNursery.cancel()#
The big red button: gracefully cancel every subactor supervised by
the nursery, in parallel, with the escalation discipline below
applied per-child. It’s invoked for you whenever an error hits the
nursery block (one-cancels-all); call it yourself for an orderly
early shutdown. Passing hard_kill=True skips the graceful phase
and goes straight to OS-level process termination — rarely what you
want outside tests.
Portal.cancel_actor()#
Cancel one whole actor: its entire runtime, every task it’s scheduled, and (for subactors) the OS process, via a graceful runtime-cancel request,
await portal.cancel_actor() # bounded wait, bool result
await portal.cancel_actor(
raise_on_timeout=True, # no ack in time?
) # -> `ActorTooSlowError`
The wait for the peer’s ack is bounded (default
Portal.cancel_timeout = 0.5 seconds, tunable per call via
timeout=). By default a missed ack just returns False; with
raise_on_timeout=True you instead get an ActorTooSlowError
(from tractor._exceptions) so your code can escalate per SC
discipline — exactly what the nursery’s own teardown does
internally before resorting to OS-level signalling.
Note the granularity: this cancels an actor, not a task. For
one remote task use the Context layer instead.
Context.cancel()#
Request cancellation of exactly one remote task: the peer task of
an open tractor.Context. Two things to keep straight,
it cancels the remote side only; a
Contextis not atrio.CancelScopeand your local task keeps running until you exit theopen_context()block,the resulting
tractor.ContextCancelledis absorbed locally (you asked for it, after all) per the self- vs. cross-cancel rule above.
Again, The Context: a cross-actor task pair covers this dance in depth.
Graceful first, hard as a last resort#
Every process teardown in tractor walks the same escalation
ladder, top rung first,
graceful cancel request: a runtime-cancel msg over IPC; the target actor cancels its tasks, closes its channels and exits its
trio.run()cleanly,soft wait: the parent waits (bounded) for the child process to exit on its own,
SIGTERM: no ack within the bounded wait (internally an
ActorTooSlowError) escalates toproc.terminate(),SIGKILL ultimatum: still alive after the hard-kill timeout (~1.6s)? The runtime logs that the “T-800” has been deployed to collect the zombie and issues
proc.kill(). No survivors.
The result is the no-zombies guarantee: tractor tries to
protect you from zombies, no matter what. Quoting the project
manifesto,
If you can create zombie child processes (without using a system signal) it is a bug.
Run the quickstart’s self-destructing process-tree demo
(examples/parallelism/we_are_processes.py, walked through in
Quickstart) under a pstree watcher and try to
catch a
straggler; we’ll wait B)
Roadmap: erlang-style strategies#
One-cancels-all is trio’s strategy and, for now, the only one
tractor ships. Pluggable erlang strategies — one-for-one
restarts, rest-for-one, transient/permanent child specs and friends
(see the supervision strategies canon) — are a long-standing
roadmap item tracked in #22. If supervisors are your jam that
issue is the place to sling opinions.
See also
The Context: a cross-actor task pair — the cross-actor task layer where per-task cancellation actually lives,
Typed messaging — the typed msg layer that raises
tractor.MsgTypeError,“Native” multi-process debugging — what cancellation does (and very carefully does not do) while a REPL is up.