Skip to content

Conversation

@Groxx
Copy link
Member

@Groxx Groxx commented Oct 15, 2025

Probable fix for our racy shutdown issue: just wait for the coroutine's goroutine to close, before stopping the next.

Since this is fairly likely to lead to getting stuck in incorrect code, it also reports failures to stop "quickly" (1 second), which is quite permissive as ~all correct code should finish on the order of a millisecond or less. Extreme system load could still cause this to be exceeded, but it retains correct behavior even then so I think we can address those log/metric complaints if/when they occur.


This also includes an earlier proof-of-concept to fix what might be a source of our goroutine leaks: using the wrong context writes to the wrong aboutToBlock channel, causing more writes to occur on it than reads, causing a deadlock.

The good(?) news is that fixing our shutdown logic reveals this deadlock in the ContextMisuse test - tests get stuck, and a ctrl-\ shows its stacks. Merging in the wrong-chan-write fix resolves that, and tests run quickly/normally again.

Thanks to some internal tooling to look for deadlocked goroutines, this test:
> appears to go.uber.org/cadence/internal.(*WorkflowTestSuiteUnitTest).Test_ContextMisuse
appears to trigger a goroutine leak.

Since it's not doing anything *impossible* for users to do (it's just using the wrong context arg),
it seems like this implies we've got incorrect goroutine shutdown code.

I'm not yet fully confident that this is The Fix and that it does not cause other issues, but it seems pretty likely so far.
I don't know if this would explain the in-production leaks we've seen in the past but not reproduced (those seem to be query-triggered somehow), but it *does* seem quite likely that this has leaked somewhere.  Just hopefully also caused errors that pointed to bad code which led to a fix.
this relies on changes merged in from maybe-bad-yield, as otherwise some tests deadlock.

Signed-off-by: Steven L <imgroxx@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant