Skip to content

Conversation

maswin
Copy link
Member

@maswin maswin commented Jul 31, 2025

Description

This commit addresses a number of items

  1. QueryCountBasedRouter was broken. The change introduced in the following commit - 3430f35 introduced a bug. provideBackendConfiguration was used to get the cluster instead of provideClusterForRoutingGroup. But this method was not overridden in QueryCountBasedRouter class. So this led to a bug and QueryCountBasedRouter was failing. Fixed it.
  2. While trying to create a new RoutingManager, it became complex to understand what all method needs to be overridden since the class was populated with lot of internal methods. So extracted them all out to an interface and modified existing class as BaseRoutingManager. selectBackend method can be overridden to modify the cluster selection part.
  3. Modified QueryCountBasedRouter to use ConcurrentHashMap instead of GaurdedBy("this") which locks the entire object. This reduces lock contention.
  4. Fixed related tests

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required, with the following suggested text:

* Fix some things.

Summary by Sourcery

Refactor the routing manager hierarchy to improve extensibility and concurrency, fix a broken query-count-based router, and update related tests and dependencies

Bug Fixes:

  • Fix broken QueryCountBasedRouter provideBackendConfiguration logic by properly overriding the cluster selection method

Enhancements:

  • Extract RoutingManager into an interface and move the default implementation into a new BaseRoutingManager abstract class
  • Refactor StochasticRoutingManager and QueryCountBasedRouter to extend BaseRoutingManager and implement a common selectBackend strategy
  • Replace synchronized List in QueryCountBasedRouter with ConcurrentHashMap for clusterStats to reduce lock contention
  • Update HealthCheckObserver to use the new updateClusterStats API

Tests:

  • Update unit tests to initialize backend and history managers, use updateClusterStats API, and adapt assertions to the new clusterStats map structure

Chores:

  • Remove error_prone_annotations dependency from pom.xml

@cla-bot cla-bot bot added the cla-signed label Jul 31, 2025
* request object. Default implementation comes here.
*/
public abstract class RoutingManager
public interface RoutingManager

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have java doc for this and method as this becomes our primary interface for all RoutingManager implementations. ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea to have an interface to enforce the rules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added javadoc

Comment on lines 90 to 92
CacheBuilder.newBuilder()
.maximumSize(10000)
.expireAfterAccess(30, TimeUnit.MINUTES)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can extract this into separate builder and reuse 3 time while building cache.

for example

    private final CacheBuilder<Object, Object> builder = CacheBuilder.newBuilder()
            .maximumSize(10000)
            .expireAfterAccess(30, TimeUnit.MINUTES);


queryIdBackendCache = builder.build(
                                new CacheLoader<>()
                                {
                                    @Override
                                    public String load(String queryId)
                                    {
                                        return findBackendForUnknownQueryId(queryId);
                                    }
                                });
        queryIdRoutingGroupCache = builder.build(
                                new CacheLoader<>()
                                {
                                    @Override
                                    public String load(String queryId)
                                    {
                                        return findRoutingGroupForUnknownQueryId(queryId);
                                    }
                                });
        queryIdExternalUrlCache = builder.build(
                                new CacheLoader<>()
                                {
                                    @Override
                                    public String load(String queryId)
                                    {
                                        return findExternalUrlForUnknownQueryId(queryId);
                                    }
                                });

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 300 to 303
TrinoStatus status = backendToStatus.get(backendId);
if (status == null) {
return true;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TrinoStatus status = backendToStatus.get(backendId);
if (status == null) {
return true;
}
TrinoStatus status = backendToStatus.getOrDefault(backendId, TrinoStatus.UNKNOWN);
return status != TrinoStatus.HEALTHY;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 127 to 113
public ProxyBackendConfiguration provideDefaultBackendConfiguration(String user)
{
List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveDefaultBackends();
backends.removeIf(backend -> isBackendNotHealthy(backend.getName()));
return selectBackend(backends, user).orElseThrow(() -> new IllegalStateException("Number of active backends found zero"));
}

/**
* Performs routing to a given cluster group. This falls back to a default backend, if no scheduled
* backend is found.
*/
@Override
public ProxyBackendConfiguration provideBackendConfiguration(String routingGroup, String user)
{
List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveBackends(routingGroup);
backends.removeIf(backend -> isBackendNotHealthy(backend.getName()));
return selectBackend(backends, user).orElseGet(() -> provideDefaultBackendConfiguration(user));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are removing the unhealthy backends from candidate list. You can use lambda to filter out unwanted entries. It will simplify the predicate function that identifies healthy backends.

For example:

Suggested change
public ProxyBackendConfiguration provideDefaultBackendConfiguration(String user)
{
List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveDefaultBackends();
backends.removeIf(backend -> isBackendNotHealthy(backend.getName()));
return selectBackend(backends, user).orElseThrow(() -> new IllegalStateException("Number of active backends found zero"));
}
/**
* Performs routing to a given cluster group. This falls back to a default backend, if no scheduled
* backend is found.
*/
@Override
public ProxyBackendConfiguration provideBackendConfiguration(String routingGroup, String user)
{
List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveBackends(routingGroup);
backends.removeIf(backend -> isBackendNotHealthy(backend.getName()));
return selectBackend(backends, user).orElseGet(() -> provideDefaultBackendConfiguration(user));
}
public ProxyBackendConfiguration provideDefaultBackendConfiguration(String user) {
var backends = gatewayBackendManager.getActiveDefaultBackends()
.stream()
.filter(backend -> isBackendHealthy(backend.getName()))
.toList();
return selectBackend(backends, user).orElseThrow(() -> new IllegalStateException("Number of active backends found zero"));
}
@Override
public ProxyBackendConfiguration provideBackendConfiguration(String routingGroup, String user) {
var backends = gatewayBackendManager.getActiveBackends(routingGroup)
.stream()
.filter(backend -> isBackendHealthy(backend.getName()))
.toList();
return selectBackend(backends, user).orElseGet(() -> provideDefaultBackendConfiguration(user));
}
private boolean isBackendHealthy(String backendId) {
TrinoStatus status = backendToStatus.getOrDefault(backendId, TrinoStatus.UNKNOWN);
return status == TrinoStatus.HEALTHY;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* This class performs health check, stats counts for each backend and provides a backend given
* request object. Default implementation comes here.
*/
public abstract class BaseRoutingManager

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion IMO, for ease of reading you may want to group methods by visibility and order them.

1/ constructor
2/ abstract methods
3/ public method
4/ protected - package
5/ private

I see the abstract methods as an interface for subclasses so it is easier to discover them by future maintainers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return routingGroup;
}

protected void updateBackEndHealth(List<ClusterStats> stats)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we delete this method ?

It get confusing as we have 3 methods that update the state in cache.
You move this method logic into the interface method public void updateClusterStats . This would simplify and we avoid method overloading of interface method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +255 to +230
if (entry.getValue().isDone()) {
int responseCode = entry.getValue().get();
if (responseCode == 200) {
log.info("Found query [%s] on backend [%s]", queryId, entry.getKey());
setBackendForQueryId(queryId, entry.getKey());
return entry.getKey();
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Curious, Why did you add condition check for future.isDone() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these code are existing - https://github.com/trinodb/trino-gateway/blob/main/gateway-ha/src/main/java/io/trino/gateway/ha/router/RoutingManager.java
I changed the file name from RoutingManager to BaseRoutingManager and created interface file with name RoutingManager. So git instead of marking it as file name change, assumed everything in the file as new changes.

Copy link
Member

@vishalya vishalya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we separate out this PR into 2 different ones.
(1) Just fix the bug and add the minimal interface needed, and the fixed tests.
(2) Rest of the refactoring

@maswin maswin force-pushed the routing branch 2 times, most recently from 8853b9b to 7cdbcbf Compare August 15, 2025 19:47
@maswin
Copy link
Member Author

maswin commented Aug 15, 2025

Can we separate out this PR into 2 different ones. (1) Just fix the bug and add the minimal interface needed, and the fixed tests. (2) Rest of the refactoring

Tried splitting it into 2 commits but feels bit complicated as the refactoring kind of took care of the bug.
The changes are relatively less, but git for some reason assumes BaseRoutingManager as a new file rather than the a name change from previous RoutingManager class (probably since there is a new interface with that name)

@vishalya
Copy link
Member

I'm concerned that having a default implementation in the BaseRoutingManager base class makes our routing logic brittle. We saw this when the interface change broke the query-count-based router.

A better approach would be to move the default logic into its own class, say DefaultRoutingManager. Then, concrete classes could use composition (a has-a relationship) to include this default behavior instead of inheriting it directly. This will decouple our concrete routers from the base class implementation, preventing similar breaks in the future.

On a related note, could you detail what new runtime tests are being added to catch these kinds of integration failures?

@maswin
Copy link
Member Author

maswin commented Aug 16, 2025

I'm concerned that having a default implementation in the BaseRoutingManager base class makes our routing logic brittle. We saw this when the interface change broke the query-count-based router.

A better approach would be to move the default logic into its own class, say DefaultRoutingManager. Then, concrete classes could use composition (a has-a relationship) to include this default behavior instead of inheriting it directly. This will decouple our concrete routers from the base class implementation, preventing similar breaks in the future.

On a related note, could you detail what new runtime tests are being added to catch these kinds of integration failures?

composition might makes things very complicated. Interface with an abstract base implementation is a common pattern which should be ok. For instance in Trino there is a TrinoCatalog interface with AbtsractTrinoCatalog implementation that has common methods implemented.

The primary problem I see is the interface is bloated and can further be made lean. There should only be 4 methods -

void updateBackEndHealth(String backendId, TrinoStatus value); // When user marks a backend unhealthy
void updateClusterStats(List<ClusterStats> stats); // Update based on JMX metrics
ProxyBackendConfiguration getBackendConfiguration(String routingGroup, String user); // Get for the first time and if not in local cache
ProxyBackendConfiguration setBackendConfiguration(String routingGroup, String user); // Set if not in local cache

Maintaining 3 separate cache and separately setting every cache makes no sense as they all point to same data. One cache with all data together should be enough and exposing just one method to get and set the backEnd configuration.

This should make things less confusing to be overridden and implemented.

If this new lean interface sounds good I can make the changes.

@Chaho12 Chaho12 self-requested a review August 20, 2025 13:27
@maswin maswin force-pushed the routing branch 2 times, most recently from 70113c2 to ed4c772 Compare August 22, 2025 20:07
return externalUrl;
}

private static LoadingCache<String, String> buildCache(Function<String, String> loader)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to be a static method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@andythsu
Copy link
Member

andythsu commented Sep 8, 2025

this is definitely a good improvement to our routing logic. Thanks for the java doc.

@andythsu
Copy link
Member

andythsu commented Oct 1, 2025

@maswin Do you still have bandwidth to work on this?

@maswin
Copy link
Member Author

maswin commented Oct 2, 2025

@maswin Do you still have bandwidth to work on this?

Sorry about the delay. Just addressed the review comments

Copy link

@jalpan-randeri jalpan-randeri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making this change.

int backendId = Math.abs(RANDOM.nextInt()) % backends.size();
return backends.get(backendId);
}
void updateBackEndHealth(String backendId, TrinoStatus value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename this to updateClusterHealth since we want to refer Trino clusters as "cluster" instead of "backend"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a number of places in which we use backend, including the UI code -
i.e,
ProxyBackendConfiguration
BackendStateManager
GatewayBackendManager

Isn't it better to change them all at once in a separate commit?

}
return externalUrl;
}
void setBackendForQueryId(String queryId, String backend);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setClusterForQueryId

// Fallback on first active backend if queryId mapping not found.
return gatewayBackendManager.getActiveBackends(defaultRoutingGroup).get(0).getProxyTo();
}
String findBackendForQueryId(String queryId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findClusterForQueryId

@oneonestar
Copy link
Member

@sourcery-ai review

@sourcery-ai
Copy link

sourcery-ai bot commented Oct 15, 2025

Reviewer's Guide

This PR refactors the routing framework by splitting the monolithic RoutingManager into a RoutingManager interface and a BaseRoutingManager implementation, fixes a bug in QueryCountBasedRouter by overriding the correct routing selection method and replacing a synchronized list with ConcurrentHashMap, renames and unifies stats and health‐update methods, and updates all affected tests to align with the new API.

Sequence diagram for backend health and stats update flow

sequenceDiagram
    participant HealthCheckObserver
    participant RoutingManager
    HealthCheckObserver->>RoutingManager: updateClusterStats(List<ClusterStats>)
    RoutingManager->>RoutingManager: updateBackEndHealth(clusterId, trinoStatus) (for each ClusterStats)
    RoutingManager->>RoutingManager: update internal backendToStatus map
Loading

Sequence diagram for backend selection via provideBackendConfiguration

sequenceDiagram
    participant Client
    participant RoutingManager
    participant BaseRoutingManager
    participant QueryCountBasedRouter
    Client->>RoutingManager: provideBackendConfiguration(routingGroup, user)
    RoutingManager->>BaseRoutingManager: provideBackendConfiguration(routingGroup, user)
    BaseRoutingManager->>QueryCountBasedRouter: selectBackend(backends, user)
    QueryCountBasedRouter-->>BaseRoutingManager: Optional<ProxyBackendConfiguration>
    BaseRoutingManager-->>Client: ProxyBackendConfiguration
Loading

Class diagram for refactored RoutingManager hierarchy

classDiagram
    class RoutingManager {
        <<interface>>
        +void updateBackEndHealth(String backendId, TrinoStatus value)
        +void updateClusterStats(List<ClusterStats> stats)
        +void setBackendForQueryId(String queryId, String backend)
        +void setRoutingGroupForQueryId(String queryId, String routingGroup)
        +String findBackendForQueryId(String queryId)
        +String findExternalUrlForQueryId(String queryId)
        +String findRoutingGroupForQueryId(String queryId)
        +ProxyBackendConfiguration provideBackendConfiguration(String routingGroup, String user)
    }
    class BaseRoutingManager {
        -GatewayBackendManager gatewayBackendManager
        -ConcurrentHashMap<String, TrinoStatus> backendToStatus
        -String defaultRoutingGroup
        -QueryHistoryManager queryHistoryManager
        -LoadingCache<String, String> queryIdBackendCache
        -LoadingCache<String, String> queryIdRoutingGroupCache
        -LoadingCache<String, String> queryIdExternalUrlCache
        +ProxyBackendConfiguration provideDefaultBackendConfiguration(String user)
        +ProxyBackendConfiguration provideBackendConfiguration(String routingGroup, String user)
        +void updateBackEndHealth(String backendId, TrinoStatus value)
        +void updateClusterStats(List<ClusterStats> stats)
        +void setBackendForQueryId(String queryId, String backend)
        +void setRoutingGroupForQueryId(String queryId, String routingGroup)
        +String findBackendForQueryId(String queryId)
        +String findExternalUrlForQueryId(String queryId)
        +String findRoutingGroupForQueryId(String queryId)
        +abstract Optional<ProxyBackendConfiguration> selectBackend(List<ProxyBackendConfiguration> backends, String user)
    }
    class StochasticRoutingManager {
        +Optional<ProxyBackendConfiguration> selectBackend(List<ProxyBackendConfiguration> backends, String user)
    }
    class QueryCountBasedRouter {
        -ConcurrentHashMap<String, LocalStats> clusterStats
        +synchronized Map<String, LocalStats> clusterStats()
        +synchronized void updateClusterStats(List<ClusterStats> stats)
        +protected synchronized Optional<ProxyBackendConfiguration> selectBackend(List<ProxyBackendConfiguration> backends, String user)
    }
    RoutingManager <|.. BaseRoutingManager
    BaseRoutingManager <|-- StochasticRoutingManager
    BaseRoutingManager <|-- QueryCountBasedRouter
Loading

Class diagram for QueryCountBasedRouter stats refactor

classDiagram
    class QueryCountBasedRouter {
        -ConcurrentHashMap<String, LocalStats> clusterStats
        +synchronized Map<String, LocalStats> clusterStats()
        +synchronized void updateClusterStats(List<ClusterStats> stats)
        +protected synchronized Optional<ProxyBackendConfiguration> selectBackend(List<ProxyBackendConfiguration> backends, String user)
    }
    class LocalStats {
        +LocalStats(ClusterStats stats)
        +int runningQueryCount()
        +int queuedQueryCount()
        +String routingGroup()
        +TrinoStatus trinoStatus()
        +ProxyBackendConfiguration backendConfiguration()
    }
    QueryCountBasedRouter "1" *-- "*" LocalStats
Loading

File-Level Changes

Change Details Files
Extract RoutingManager interface and introduce BaseRoutingManager
  • Defined RoutingManager interface to declare public routing API
  • Created BaseRoutingManager to house shared logic (caches, health/stats updates, query lookup)
  • Updated StochasticRoutingManager and QueryCountBasedRouter to extend BaseRoutingManager
  • Removed internal helper methods from original RoutingManager class
RoutingManager.java
BaseRoutingManager.java
StochasticRoutingManager.java
QueryCountBasedRouter.java
HealthCheckObserver.java
pom.xml
Fix QueryCountBasedRouter override and improve concurrency
  • Overrode selectBackend/provideBackendConfiguration in QueryCountBasedRouter to invoke correct base logic
  • Replaced @GuardedBy synchronized List with ConcurrentHashMap for clusterStats
  • Adjusted getClusterToRoute logic to use the new storage and override method signature
QueryCountBasedRouter.java
Rename and unify stats/health update methods
  • Renamed updateBackEndStats to updateClusterStats across interface and implementations
  • Standardized updateBackEndHealth signature in RoutingManager interface
  • Modified HealthCheckObserver to call updateClusterStats
RoutingManager.java
BaseRoutingManager.java
HealthCheckObserver.java
Update tests to reflect refactoring
  • Reworked TestQueryCountBasedRouter to initialize backendManager and historyManager, use updateClusterStats
  • Adjusted cluster IDs in tests to include routingGroup suffix
  • Changed TestRoutingManagerExternalUrlCache to use StochasticRoutingManager and updated constructor calls
TestQueryCountBasedRouter.java
TestRoutingManagerExternalUrlCache.java

Possibly linked issues

  • #Gateway refuses to route to non-adhoc group if adhoc group is unhealthy: The PR fixes a bug in QueryCountBasedRouter where it failed to correctly select a healthy backend, leading to routing failures as described in the issue.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java:232` </location>
<code_context>
+    // queries
     @Override
-    public synchronized void updateBackEndStats(List<ClusterStats> stats)
+    protected synchronized Optional<ProxyBackendConfiguration> selectBackend(List<ProxyBackendConfiguration> backends, String user)
     {
-        clusterStats = stats.stream().map(a -> new LocalStats(a)).collect(Collectors.toList());
</code_context>

<issue_to_address>
**suggestion:** selectBackend implementation does not handle missing clusterStats for a backend gracefully.

Consider adding logging or error handling for backends missing from clusterStats to help identify potential monitoring or initialization issues.

Suggested implementation:

```java
    @Override
    protected synchronized Optional<ProxyBackendConfiguration> selectBackend(List<ProxyBackendConfiguration> backends, String user)
    {
        // Log missing clusterStats for backends
        backends.stream()
                .filter(backend -> !clusterStats.containsKey(backend.getName()))
                .forEach(backend -> {
                    logger.warn("Missing clusterStats for backend: {}", backend.getName());
                });

        Optional<ProxyBackendConfiguration> cluster = backends.stream()
                .filter(backend -> clusterStats.containsKey(backend.getName()))
                .min((a, b) -> compareStats(clusterStats.get(a.getName()), clusterStats.get(b.getName()), user));
        cluster.ifPresent(c -> updateLocalStats(clusterStats.get(c.getName()), user));
        return cluster;
    }
}

```

Make sure that a suitable logger (e.g., `private static final Logger logger = LoggerFactory.getLogger(QueryCountBasedRouter.class);`) is defined in the class if it is not already present.
If you use a different logging framework, adjust the logging statement accordingly.
</issue_to_address>

### Comment 2
<location> `gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java:52` </location>
<code_context>
+        implements RoutingManager
+{
+    private static final Logger log = Logger.get(BaseRoutingManager.class);
+    private final ExecutorService executorService = Executors.newFixedThreadPool(5);
+    private final GatewayBackendManager gatewayBackendManager;
+    private final ConcurrentHashMap<String, TrinoStatus> backendToStatus;
</code_context>

<issue_to_address>
**issue (bug_risk):** ExecutorService is never shut down, which may lead to resource leaks.

Consider adding a shutdown method or integrating ExecutorService lifecycle management to prevent resource leaks when RoutingManager instances are disposed.
</issue_to_address>

### Comment 3
<location> `gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java:278` </location>
<code_context>
+                        });
+    }
+
+    private boolean isBackendHealthy(String backendId)
+    {
+        TrinoStatus status = backendToStatus.getOrDefault(backendId, TrinoStatus.UNKNOWN);
</code_context>

<issue_to_address>
**suggestion:** isBackendHealthy defaults to TrinoStatus.UNKNOWN, which may mask unhealthy backends.

Since UNKNOWN is used as the default when a backend's status is missing, please confirm if UNKNOWN should be considered unhealthy. Also, consider adding logging to highlight when backend health is not tracked.

Suggested implementation:

```java
    private boolean isBackendHealthy(String backendId)
    {
        TrinoStatus status = backendToStatus.getOrDefault(backendId, TrinoStatus.UNKNOWN);
        if (status == TrinoStatus.UNKNOWN) {
            // Log a warning when backend health is not tracked
            logger.warn("Backend health for '{}' is UNKNOWN and not tracked.", backendId);
            return false;
        }
        return status == TrinoStatus.HEALTHY;
    }
}

```

Make sure that a suitable logger (e.g., `private static final Logger logger = LoggerFactory.getLogger(BaseRoutingManager.class);`) is defined in the class if it isn't already. If not, add the logger declaration at the top of the class.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

// queries
@Override
public synchronized void updateBackEndStats(List<ClusterStats> stats)
protected synchronized Optional<ProxyBackendConfiguration> selectBackend(List<ProxyBackendConfiguration> backends, String user)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: selectBackend implementation does not handle missing clusterStats for a backend gracefully.

Consider adding logging or error handling for backends missing from clusterStats to help identify potential monitoring or initialization issues.

Suggested implementation:

    @Override
    protected synchronized Optional<ProxyBackendConfiguration> selectBackend(List<ProxyBackendConfiguration> backends, String user)
    {
        // Log missing clusterStats for backends
        backends.stream()
                .filter(backend -> !clusterStats.containsKey(backend.getName()))
                .forEach(backend -> {
                    logger.warn("Missing clusterStats for backend: {}", backend.getName());
                });

        Optional<ProxyBackendConfiguration> cluster = backends.stream()
                .filter(backend -> clusterStats.containsKey(backend.getName()))
                .min((a, b) -> compareStats(clusterStats.get(a.getName()), clusterStats.get(b.getName()), user));
        cluster.ifPresent(c -> updateLocalStats(clusterStats.get(c.getName()), user));
        return cluster;
    }
}

Make sure that a suitable logger (e.g., private static final Logger logger = LoggerFactory.getLogger(QueryCountBasedRouter.class);) is defined in the class if it is not already present.
If you use a different logging framework, adjust the logging statement accordingly.

@vishalya
Copy link
Member

vishalya commented Oct 21, 2025

sourcery-ai has 2 good suggestions. Otherwise LGTM, don't forget to rebase with the main branch.

@maswin
Copy link
Member Author

maswin commented Oct 23, 2025

sourcery-ai has 2 good suggestions. Otherwise LGTM, don't forget to rebase with the main branch.

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants