Skip to content

Graceful Shutdown Fail on Keyboard Interrupt #24

@mattjhawken

Description

@mattjhawken

When giving a Keyboard Interrupt, node threads shutdown gracefully, but the node process sometimes gets stuck after receiving a stop request from the signal handler. This issue is likely a race condition specific only to worker and user nodes, which require multi-process communication to their PyTorch workflows (unconfirmed).

Can be replicated by running examples/distributed_workflow.py and interrupting at any point after all the nodes have started. Once all node threads have shutdown and printed Node stopped. to console, one is held up waiting for a response from the stop signal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions