-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
bugSomething isn't workingSomething isn't working
Description
When giving a Keyboard Interrupt, node threads shutdown gracefully, but the node process sometimes gets stuck after receiving a stop
request from the signal handler. This issue is likely a race condition specific only to worker and user nodes, which require multi-process communication to their PyTorch workflows (unconfirmed).
Can be replicated by running examples/distributed_workflow.py
and interrupting at any point after all the nodes have started. Once all node threads have shutdown and printed Node stopped.
to console, one is held up waiting for a response from the stop
signal.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working