Skip to content

Conversation

sanketkedia
Copy link
Contributor

@sanketkedia sanketkedia commented Sep 5, 2025

Description of changes

  • Improvements & Bug fixes
    • This PR adds client side retries to the python client in case of retryable errors from the FE
  • New functionality
    • ...

Test plan

tested in tilt

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

None. Need to send out a notice to upgrade clients

Observability plan

In tilt and staging

Documentation Changes

None

Copy link
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@sanketkedia sanketkedia mentioned this pull request Sep 5, 2025
1 task
Copy link

github-actions bot commented Sep 5, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@sanketkedia sanketkedia marked this pull request as ready for review September 5, 2025 00:19
Copy link
Contributor

propel-code-bot bot commented Sep 5, 2025

Add and Standardize Client-Side Retries for Python and JavaScript Clients

This PR implements configurable client-side retry logic for both the Python (chromadb.api.fastapi.FastAPI, chromadb.api.async_fastapi.AsyncFastAPI) and JavaScript clients of ChromaDB. The retry system is designed to automatically retry appropriate requests (e.g., on network/transient server errors such as HTTP 502/503/504 status codes or network timeouts) with exponential backoff and optional jitter. The retry configuration (e.g., max attempts, min/max delay, factor, jitter) is made configurable in both languages via client arguments and settings. Support for retries is extended throughout all critical client paths, and the JS SDK, admin-client, and related infrastructure are updated accordingly.

Key Changes

• Introduced a retry configuration model (RetryConfig) for both Python (config.py) and JavaScript (retry.ts) with parameters for factor, minDelay, maxDelay, maxAttempts, and jitter.
• Added exponential backoff with jitter (using tenacity for Python sync/async and custom logic in JS) to all HTTP API requests in the Python and JS clients.
• Implemented retryable error/status detection for common transient failures (e.g., HTTP 502, 503, 504 and select network exceptions).
• Wired retry configuration into Python client settings, and made it injectable in JS/TS clients (including both ChromaClient and AdminClient).
• Provided a mechanism to disable retries by setting the retry config to null.
• Refactored the underlying fetch/request logic to fully support retries and avoid mutation pitfalls on repeated attempts.
• Updated client and utility defaults to enable retries out-of-the-box.
• Added supporting documentation and in-code comments to clarify retry usage and default behaviors.

Affected Areas

chromadb/api/fastapi.py
chromadb/api/async_fastapi.py
chromadb/config.py
clients/new-js/packages/chromadb/src/chroma-fetch.ts
clients/new-js/packages/chromadb/src/chroma-client.ts
clients/new-js/packages/chromadb/src/admin-client.ts
clients/new-js/packages/chromadb/src/utils.ts
clients/new-js/packages/chromadb/src/retry.ts

This summary was automatically generated by @propel-code-bot

Copy link
Contributor Author

sanketkedia commented Sep 5, 2025

2 Jobs Failed:

PR checks / all-required-pr-checks-passed failed on "Decide whether the needed jobs succeeded or failed"
[...]
}
EOM
)"
shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
env:
  GITHUB_REPO_NAME: chroma-core/chroma
  PYTHONPATH: /home/runner/_work/_actions/re-actors/alls-green/release/v1/src
# ❌ Some of the required to succeed jobs failed 😢😢😢

📝 Job statuses:
📝 python-tests → ✓ success [required to succeed or be skipped]
📝 python-vulnerability-scan → ✓ success [required to succeed or be skipped]
📝 javascript-client-tests → ✓ success [required to succeed or be skipped]
📝 rust-tests → ❌ failure [required to succeed or be skipped]
📝 go-tests → ✓ success [required to succeed or be skipped]
📝 lint → ✓ success [required to succeed]
📝 check-helm-version-bump → ⬜ skipped [required to succeed or be skipped]
📝 delete-helm-comment → ✓ success [required to succeed or be skipped]
Error: Process completed with exit code 1.
PR checks / Rust tests / Integration test ci_k8s_integration_slow 1 failed on "Unknown Step"

Summary: 1 successful workflow, 1 failed workflow

Last updated: 2025-09-25 18:41:50 UTC

Copy link
Contributor

@jairad26 jairad26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add the same retry logic for js as well

Copy link
Contributor

@codetheweb codetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree we should do this for JS as well

retry=retry_if_exception_type(is_retryable_exception),
before_sleep=before_sleep_log(logger, logging.INFO),
reraise=True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably ok for now but maybe good to make this configurable in the future?

@jairad26 jairad26 force-pushed the 09-02-_enh_consolidate_retries branch from b614a69 to b381e79 Compare September 23, 2025 20:06
@jairad26 jairad26 force-pushed the 09-04-_enh_client_side_retries branch from 7690f2c to ef8c70f Compare September 23, 2025 20:06
Comment on lines 136 to 141
def _request_with_retry():
# If the request has json in kwargs, use orjson to serialize it,
# remove it from kwargs, and add it to the content parameter
# This is because httpx uses a slower json serializer
if "json" in kwargs:
data = orjson.dumps(kwargs.pop("json"), option=orjson.OPT_SERIALIZE_NUMPY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

Critical bug: The kwargs dictionary is being mutated inside the retry function, which will cause issues on subsequent retry attempts. When kwargs.pop("json") is called on the first attempt, the "json" key is removed from kwargs permanently. On retry attempts, the JSON data will be missing, leading to malformed requests.

Suggested Change
Suggested change
def _request_with_retry():
# If the request has json in kwargs, use orjson to serialize it,
# remove it from kwargs, and add it to the content parameter
# This is because httpx uses a slower json serializer
if "json" in kwargs:
data = orjson.dumps(kwargs.pop("json"), option=orjson.OPT_SERIALIZE_NUMPY)
def _request_with_retry():
# Create a copy of kwargs to avoid mutation across retries
request_kwargs = kwargs.copy()
# If the request has json in kwargs, use orjson to serialize it,
# remove it from kwargs, and add it to the content parameter
# This is because httpx uses a slower json serializer
if "json" in request_kwargs:
data = orjson.dumps(request_kwargs.pop("json"), option=orjson.OPT_SERIALIZE_NUMPY)
request_kwargs["content"] = data
# Unlike requests, httpx does not automatically escape the path
escaped_path = urllib.parse.quote(path, safe="/", encoding=None, errors=None)
url = self._api_url + escaped_path
response = self._session.request(method, url, **cast(Any, request_kwargs))
BaseHTTPClient._raise_chroma_error(response)
return orjson.loads(response.text)

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**CriticalError**]

Critical bug: The `kwargs` dictionary is being mutated inside the retry function, which will cause issues on subsequent retry attempts. When `kwargs.pop("json")` is called on the first attempt, the "json" key is removed from kwargs permanently. On retry attempts, the JSON data will be missing, leading to malformed requests.

<details>
<summary>Suggested Change</summary>

```suggestion
        def _request_with_retry():
            # Create a copy of kwargs to avoid mutation across retries
            request_kwargs = kwargs.copy()
            # If the request has json in kwargs, use orjson to serialize it,
            # remove it from kwargs, and add it to the content parameter
            # This is because httpx uses a slower json serializer
            if "json" in request_kwargs:
                data = orjson.dumps(request_kwargs.pop("json"), option=orjson.OPT_SERIALIZE_NUMPY)
                request_kwargs["content"] = data

            # Unlike requests, httpx does not automatically escape the path
            escaped_path = urllib.parse.quote(path, safe="/", encoding=None, errors=None)
            url = self._api_url + escaped_path

            response = self._session.request(method, url, **cast(Any, request_kwargs))
            BaseHTTPClient._raise_chroma_error(response)
            return orjson.loads(response.text)
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: chromadb/api/fastapi.py
Line: 141

Comment on lines 67 to 88
def is_retryable_exception(exception: BaseException) -> bool:
if isinstance(exception, (
httpx.ConnectError,
httpx.ConnectTimeout,
httpx.ReadTimeout,
httpx.WriteTimeout,
httpx.PoolTimeout,
httpx.NetworkError,
httpx.RemoteProtocolError,
)):
return True

if isinstance(exception, httpx.HTTPStatusError):
# Retry on server errors that might be temporary
return exception.response.status_code in [502, 503, 504]

return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

To make this new retry functionality more robust and flexible, I have a couple of suggestions:

  1. Testing: It would be beneficial to add unit tests for is_retryable_exception to cover the various exception types and status codes. This would help prevent regressions in the retry logic.

  2. Configurability: The retry parameters (e.g., number of attempts, backoff factor) are currently hardcoded. It might be beneficial to make these configurable via Settings. This would allow users to tune the retry behavior for their specific network conditions.

Context for Agents
[**BestPractice**]

To make this new retry functionality more robust and flexible, I have a couple of suggestions:

1.  **Testing**: It would be beneficial to add unit tests for `is_retryable_exception` to cover the various exception types and status codes. This would help prevent regressions in the retry logic.

2.  **Configurability**: The retry parameters (e.g., number of attempts, backoff factor) are currently hardcoded. It might be beneficial to make these configurable via `Settings`. This would allow users to tune the retry behavior for their specific network conditions.

File: chromadb/api/fastapi.py
Line: 83

@jairad26 jairad26 force-pushed the 09-04-_enh_client_side_retries branch from ef8c70f to 65487ad Compare September 23, 2025 22:26
@jairad26 jairad26 force-pushed the 09-02-_enh_consolidate_retries branch from b381e79 to 128ad9d Compare September 23, 2025 22:26
@jairad26 jairad26 force-pushed the 09-04-_enh_client_side_retries branch from 65487ad to 93023f0 Compare September 23, 2025 23:44
Comment on lines 130 to +186
def _make_request(self, method: str, path: str, **kwargs: Dict[str, Any]) -> Any:
# If the request has json in kwargs, use orjson to serialize it,
# remove it from kwargs, and add it to the content parameter
# This is because httpx uses a slower json serializer
if "json" in kwargs:
data = orjson.dumps(kwargs.pop("json"), option=orjson.OPT_SERIALIZE_NUMPY)
kwargs["content"] = data

# Unlike requests, httpx does not automatically escape the path
escaped_path = urllib.parse.quote(path, safe="/", encoding=None, errors=None)
url = self._api_url + escaped_path

response = self._session.request(method, url, **cast(Any, kwargs))
BaseHTTPClient._raise_chroma_error(response)
return orjson.loads(response.text)
def _send_request() -> Any:
# If the request has json in kwargs, use orjson to serialize it,
# remove it from kwargs, and add it to the content parameter
# This is because httpx uses a slower json serializer
if "json" in kwargs:
data = orjson.dumps(
kwargs.pop("json"), option=orjson.OPT_SERIALIZE_NUMPY
)
kwargs["content"] = data

# Unlike requests, httpx does not automatically escape the path
escaped_path = urllib.parse.quote(
path, safe="/", encoding=None, errors=None
)
url = self._api_url + escaped_path

response = self._session.request(method, url, **cast(Any, kwargs))
BaseHTTPClient._raise_chroma_error(response)
return orjson.loads(response.text)

retry_config = self._settings.retry_config

if retry_config is None:
return _send_request()

min_delay = max(float(retry_config.min_delay), 0.0)
max_delay = max(float(retry_config.max_delay), min_delay)
multiplier = max(min_delay, 1e-3)
exp_base = retry_config.factor if retry_config.factor > 0 else 2.0

wait_args = {
"multiplier": multiplier,
"min": min_delay,
"max": max_delay,
"exp_base": exp_base,
}

wait_strategy = (
wait_random_exponential(**wait_args)
if retry_config.jitter
else wait_exponential(**wait_args)
)

retrying = Retrying(
stop=stop_after_attempt(retry_config.max_attempts),
wait=wait_strategy,
retry=retry_if_exception(is_retryable_exception),
before_sleep=before_sleep_log(logger, logging.INFO),
reraise=True,
)

try:
return retrying(_send_request)
except RetryError as e:
# Re-raise the last exception that caused the retry to fail
raise e.last_attempt.exception() from None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

[CodeDuplication] The retry logic implemented in _make_request is nearly identical to the logic in chromadb/api/async_fastapi.py's _make_request method (lines 153-209). This introduces code duplication, making future maintenance harder.

Consider refactoring the common retry configuration and execution logic into a shared helper function. This would centralize the retry strategy and reduce redundancy. For example, a helper could build the Retrying or AsyncRetrying object based on a flag.

Context for Agents
[**BestPractice**]

[CodeDuplication] The retry logic implemented in `_make_request` is nearly identical to the logic in `chromadb/api/async_fastapi.py`'s `_make_request` method (lines 153-209). This introduces code duplication, making future maintenance harder.

Consider refactoring the common retry configuration and execution logic into a shared helper function. This would centralize the retry strategy and reduce redundancy. For example, a helper could build the `Retrying` or `AsyncRetrying` object based on a flag.

File: chromadb/api/fastapi.py
Line: 186

@jairad26 jairad26 force-pushed the 09-04-_enh_client_side_retries branch from 93023f0 to 7df8178 Compare September 23, 2025 23:54
@jairad26 jairad26 force-pushed the 09-04-_enh_client_side_retries branch from 7df8178 to e80460b Compare September 25, 2025 18:06
@jairad26 jairad26 force-pushed the 09-02-_enh_consolidate_retries branch from 128ad9d to e24519c Compare September 25, 2025 18:06
Comment on lines +103 to +104
min_delay: int = 1
max_delay: int = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

For consistency with the Javascript client and to allow for more granular control over backoff timing, consider changing min_delay and max_delay to float type. The implementation already casts these values to float, so this change would make the model's type hint more accurate.

Suggested change
min_delay: int = 1
max_delay: int = 5
min_delay: float = 1.0
max_delay: float = 5.0

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

For consistency with the Javascript client and to allow for more granular control over backoff timing, consider changing `min_delay` and `max_delay` to `float` type. The implementation already casts these values to float, so this change would make the model's type hint more accurate.

```suggestion
    min_delay: float = 1.0
    max_delay: float = 5.0
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

File: chromadb/config.py
Line: 104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants