Cancellation and error propagation#

tractor supports trio’s cancellation system verbatim, then extends it across process boundaries. If you know how to cancel a task in trio you already know how to cancel an actor — and its whole subtree — in tractor; the runtime’s job is making that statement hold over IPC with every structured concurrency (SC) guarantee intact.

The ground rules,

  • a remote actor is never cancelled unless explicitly requested (by a parent or peer), unless supervision demands it (an error triggered one-cancels-all teardown), or unless there’s a bug in tractor itself (please report it!),

  • (remote) errors always propagate back to the parent supervisor; nothing is silently dropped on the floor,

  • every spawned process gets reaped no matter how it dies; if you can create a zombie child process (without using a system signal) it is a bug.

trio cancellation, across the wire#

Locally everything is bog-standard trio: nurseries, cancel scopes, timeouts. tractor adds exactly one twist: a cancel scope can’t physically reach into another process, so the runtime relays cancellation as messages. Concretely,

  • cancelling an actor means sending it a runtime-cancel request msg; the target then runs its own graceful teardown — cancelling RPC tasks, closing channels, exiting its trio.run() — and acks the request back to the canceller,

  • cancelling a single cross-actor task works through the tractor.Context layer: each ctx task-pair is cancel-scope-linked over IPC such that either side erroring or cancelling relays an equivalent error to the other side (see The Context: a cross-actor task pair for the gory details),

  • a cancel is therefore always a request with an ack: the canceller does a bounded wait for confirmation and escalates if the peer is unresponsive (see the teardown ladder below).

One-cancels-all supervision#

An tractor.ActorNursery supervises subactors exactly like trio nurseries supervise tasks: when one child errors, the error propagates to the supervising block and all sibling subactors get cancelled before the error continues bubbling up the (process) tree.

error propagation up a subactor tree

One-cancels-all: no zombies, no lost errors.#

examples/remote_error_propagation.py#
import trio
import tractor


async def assert_err():
    assert 0


async def main():
    async with tractor.open_nursery() as n:
        real_actors = []
        for i in range(3):
            real_actors.append(await n.start_actor(
                f'actor_{i}',
                enable_modules=[__name__],
            ))

        # start one actor that will fail immediately
        await n.run_in_actor(assert_err)

    # should error here with a ``RemoteActorError`` containing
    # an ``AssertionError`` and all the other actors have been cancelled


if __name__ == '__main__':
    try:
        # also raises
        trio.run(main)
    except tractor.RemoteActorError:
        print("Look Maa that actor failed hard, hehhh!")

What’s going on here?

  • three healthy actors are spawned as daemons via tractor.ActorNursery.start_actor(); left alone they’d happily idle forever,

  • a fourth actor runs assert_err() via .run_in_actor() and promptly trips its assert 0,

  • the resulting AssertionError ships back over IPC as a serialized error msg and re-raises boxed inside the nursery block as a tractor.RemoteActorError,

  • the nursery reacts like any trio nursery would: it cancels the three healthy siblings (graceful runtime-cancel requests, acks awaited), reaps all four processes, then re-raises,

  • trio.run(main) sees that same RemoteActorError in the parent-most process — propagation is end-to-end or bust.

This one-cancels-all style is currently the only supervision strategy offered (it’s the one trio gives you); more erlang strategies are roadmap, see the bottom of this page.

The boxed-error bestiary#

All remote failures arrive locally as one of a small set of exception types, each carrying enough metadata to work out who failed, where, and why.

RemoteActorError#

The workhorse: a “boxed” exception relayed over IPC from another actor. The original error’s type, traceback string and msgdata are preserved so you can pattern-match on what actually went wrong remotely,

  • .boxed_type: the reconstructed type of the original remote exception (ValueError, NameError, what have you),

  • .src_uid: the (name, uuid) pair of the actor where the error originated,

  • .relay_uid / .relay_path: when an error crosses more than one actor boundary (grandchild -> child -> root) every relaying actor is recorded; multi-hop boxings are lovingly referred to as “inceptions” in the runtime internals,

  • .pformat(): a rich “tb box” rendering of the remote traceback for your logs or REPL.

try:
    async with portal.open_context(ep) as (ctx, first):
        ...
except tractor.RemoteActorError as rae:
    if rae.boxed_type is ValueError:
        ...  # the remote task raised `ValueError`

ContextCancelled#

The cancel-ack for a cross-actor task pair: raised when a tractor.Context task is cancelled by request. Its .canceller attr is the uid of the actor which requested the cancel, which powers the key rule,

  • if you requested it (you called tractor.Context.cancel()) the resulting ctxc is absorbed at open_context() exit: an expected outcome, not an error,

  • if anyone else did — the peer task, or some third-party actor — it raises locally so your code always hears about it.

The full self- vs. cross-cancel semantics are a core teaching point of The Context: a cross-actor task pair; go read them there.

MsgTypeError#

An IPC-payload “type error”: a msg violated the dialog’s declared payload spec. See Typed messaging for the typed-messaging system which enforces it.

TransportClosed#

The underlying IPC transport (TCP stream, UDS socket, …) died or closed out from under a channel. You’ll normally only see this surface when a peer hard-exits without any graceful runtime teardown; the supervision machinery treats unexpected transport loss on a busy channel as a failure and tears down accordingly.

Pick your blast radius#

Three cancel surfaces, three scopes of effect; choose the smallest hammer that does the job.

surface

cancels

typical use

tractor.ActorNursery.cancel()

every subactor in the nursery

whole-tree teardown

tractor.Portal.cancel_actor()

one actor: full runtime + proc

daemon teardown

tractor.Context.cancel()

exactly one remote task

surgical task cancel

ActorNursery.cancel()#

The big red button: gracefully cancel every subactor supervised by the nursery, in parallel, with the escalation discipline below applied per-child. It’s invoked for you whenever an error hits the nursery block (one-cancels-all); call it yourself for an orderly early shutdown. Passing hard_kill=True skips the graceful phase and goes straight to OS-level process termination — rarely what you want outside tests.

Portal.cancel_actor()#

Cancel one whole actor: its entire runtime, every task it’s scheduled, and (for subactors) the OS process, via a graceful runtime-cancel request,

await portal.cancel_actor()    # bounded wait, bool result
await portal.cancel_actor(
    raise_on_timeout=True,     # no ack in time?
)                              # -> `ActorTooSlowError`

The wait for the peer’s ack is bounded (default Portal.cancel_timeout = 0.5 seconds, tunable per call via timeout=). By default a missed ack just returns False; with raise_on_timeout=True you instead get an ActorTooSlowError (from tractor._exceptions) so your code can escalate per SC discipline — exactly what the nursery’s own teardown does internally before resorting to OS-level signalling.

Note the granularity: this cancels an actor, not a task. For one remote task use the Context layer instead.

Context.cancel()#

Request cancellation of exactly one remote task: the peer task of an open tractor.Context. Two things to keep straight,

  • it cancels the remote side only; a Context is not a trio.CancelScope and your local task keeps running until you exit the open_context() block,

  • the resulting tractor.ContextCancelled is absorbed locally (you asked for it, after all) per the self- vs. cross-cancel rule above.

Again, The Context: a cross-actor task pair covers this dance in depth.

Graceful first, hard as a last resort#

Every process teardown in tractor walks the same escalation ladder, top rung first,

  1. graceful cancel request: a runtime-cancel msg over IPC; the target actor cancels its tasks, closes its channels and exits its trio.run() cleanly,

  2. soft wait: the parent waits (bounded) for the child process to exit on its own,

  3. SIGTERM: no ack within the bounded wait (internally an ActorTooSlowError) escalates to proc.terminate(),

  4. SIGKILL ultimatum: still alive after the hard-kill timeout (~1.6s)? The runtime logs that the “T-800” has been deployed to collect the zombie and issues proc.kill(). No survivors.

The result is the no-zombies guarantee: tractor tries to protect you from zombies, no matter what. Quoting the project manifesto,

If you can create zombie child processes (without using a system signal) it is a bug.

Run the quickstart’s self-destructing process-tree demo (examples/parallelism/we_are_processes.py, walked through in Quickstart) under a pstree watcher and try to catch a straggler; we’ll wait B)

Roadmap: erlang-style strategies#

One-cancels-all is trio’s strategy and, for now, the only one tractor ships. Pluggable erlang strategies — one-for-one restarts, rest-for-one, transient/permanent child specs and friends (see the supervision strategies canon) — are a long-standing roadmap item tracked in #22. If supervisors are your jam that issue is the place to sling opinions.

See also