Skip to content

Conversation

arjan-bal
Copy link
Contributor

Original PRs: #8610, #8615

RELEASE NOTES:

  • balancer/pick_first:
    • Fix bug that can cause balancer to get stuck in IDLE state on backend address change.
    • When configured, shuffle addresses in resolver updates that lack endpoints. Since gRPC automatically adds endpoints to resolver updates, this bug should only affect implementers of custom LB policies that use pick_first for delegation but don't forward the endpoints.

… endpoints (grpc#8610)

The new `pick_first`, which is the default, doesn't shuffle the
addresses at all for resolver updates that are missing the `Endpoints`
field. This change fixes that. Since [gRPC automatically sets the the
missing
`Endpoints`](https://github.com/grpc/grpc-go/blob/1059e84f885bf7ed65b3b1a4fbe914360d8ab5b1/resolver_wrapper.go#L136-L138),
occurrence of this bug should be uncommon in practice.

RELEASE NOTES:
* balancer/pick_first: When configured, shuffle addresses in resolver
updates that lack endpoints. Since gRPC automatically adds endpoints to
resolver updates, this bug should only affect implementers of custom LB
policies that use pick_first for delegation but don't forward the
endpoints.
…pc#8615)

Related issue: b/415354418

## Problem

On connection breakage, the pickfirst leaf balancer enters idle and
returns an `Idle picker` that calls the balancer's `ExitIdle` method
only the first time `Pick` is called. The following sequence of events
will cause the balancer to get stuck in `Idle` state:
1. Existing connection breaks, SubConn [requests re-resolution and
reports
IDLE](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/clientconn.go#L1388-L1393).
In turn PF updates the ClientConn state to IDLE with an `Idle picker`.
1. An RPC is made, triggering `balancer.ExitIdle` through the idle
picker. The balancer attempts to re-connect the failed SubConn.
1. The resolver produces a new endpoint list, removing the endpoint used
by the existing SubConn. PF removes the existing SubConn. Since the
balancer didn't update the ClientConn state to CONNECTING yet, pickfirst
thinks that it's still in IDLE and doesn't start connecting to the new
endpoints.
1. New RPC requests trigger the idle picker, but it's a no-op since it
only [triggers the balancer's ExitIdle method
once](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663).

## Fix

This change moves the ClientConn into Connecting immediately when the
`ExitIdle` method is called. This ensures that the balancer continues to
re-connect when a new endpoint list is produced by the resolver.

RELEASE NOTES:
* balancer/pickfirst: Fix bug that can cause balancer to get stuck in
`IDLE` state on connection failure.
@arjan-bal arjan-bal added this to the 1.76 Release milestone Oct 1, 2025
Copy link

codecov bot commented Oct 1, 2025

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.95%. Comparing base (0513350) to head (e6c9711).

Files with missing lines Patch % Lines
balancer/pickfirst/pickfirst.go 0.00% 1 Missing ⚠️
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go 88.88% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           v1.76.x    #8621      +/-   ##
===========================================
- Coverage    82.12%   81.95%   -0.18%     
===========================================
  Files          415      415              
  Lines        40686    40693       +7     
===========================================
- Hits         33412    33348      -64     
- Misses        5896     5968      +72     
+ Partials      1378     1377       -1     
Files with missing lines Coverage Δ
balancer/pickfirst/pickfirst.go 2.46% <0.00%> (-32.10%) ⬇️
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go 87.25% <88.88%> (+0.76%) ⬆️

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants