Skip to content

Conversation

sanketkedia
Copy link
Contributor

@sanketkedia sanketkedia commented Sep 4, 2025

Description of changes

  • Improvements & Bug fixes
    • Before this PR, we were retrying to query nodes both in the Frontend as well as in the executor. This PR consolidates it to only retry at the executor level. The executor has state about the memberlist so it makes sense for it to retry. In future, we'll make it more intelligent for e.g. to skip sending a retry request to the same query node. It also opens up opportunities for hedging if we keep it at the executor.
    • In some cases, retries need to invalidate the FE cache for e.g. to get the latest information about collections and segments. This is done by the FE passing a callback (closure) to the executor. It is called replan_closure and gives the FE a way to prepare a fresh plan for the query on a retry.
    • The executor retries with exponential backoff in cases when it gets unavailability or backoff errors from downstream. It retries with 0 delay in the case when it gets a NotFound error meaning it has to update the segments info and retry again.
    • Also added retry metrics at the executor.
    • In addition, we were missing retries from the FE to the sysdb for all the sysdb CRUD rpcs like create collection, get collection, list collection, etc. This PR also adds these retries with exponential backoff.
  • New functionality
    • ...

Test plan

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

None

Observability plan

Observed in local tilt. Will observe it staging too

Documentation Changes

None

Copy link

github-actions bot commented Sep 4, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor Author

sanketkedia commented Sep 4, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@sanketkedia sanketkedia changed the title [ENH]: consolidate retries [ENH]: Retries everywhere Sep 4, 2025
@sanketkedia sanketkedia marked this pull request as ready for review September 4, 2025 22:48
Copy link
Contributor

propel-code-bot bot commented Sep 4, 2025

Consolidated and Enhanced Retry Logic Across Executor and System Database Operations

This PR overhauls the retry handling in the query execution path and system database operations. Retries that were previously scattered across the frontend and executor have been consolidated to occur only at the executor, which now handles all retry logic with improved awareness of the memberlist and state. The executor now supports callbacks (replan closures) from the frontend to allow cache invalidation and query plan reconstruction upon retry, and tracks retry metrics. In addition, system database (sysdb) CRUD requests have gained robust exponential backoff retry handling. Error propagation, metrics, configuration, and test coverage were updated as required.

Key Changes

• Centralized all query retries at the executor level; removed redundant frontend retries.
• Added support for a replan_closure callback passed from the frontend, used for cache invalidation and fresh plan generation on retry.
• Executor retry logic now distinguishes error types (e.g., NotFound results in immediate retry with FE cache invalidation, Unavailable/ResourceExhausted/Aborted/DeadlineExceeded/Unknown use exponential backoff).
• Integrated metrics tracking for retries at various executor operations using OpenTelemetry counters.
• Added exponential backoff retries for all sysdb CRUD RPCs (create, get, list, etc).
• Refactored code for improved error propagation with ChromaError codes, and made error trait implementations more granular.
• Refactored operator and plan error to trace appropriate error codes and propagation.
• Upgraded backon crate from v1.3.0 to v1.5.2 and updated usages accordingly.
• Updated and simplified FE and sysdb provider configs (removed custom sysdb retry policy).
• Refactored executor API and frontend-operator interfaces to support required callback closure signatures.
• Expanded and adjusted tests for new error cases and retry logic.

Affected Areas

• Query executor logic (distributed and local)
• Frontend operator and service APIs
• Sysdb CRUD provider and cache layer
• Error propagation (operator.rs/plan.rs/log.rs)
OpenTelemetry metrics for retries
• Configuration structures (frontend, executor, sysdb, cache)
• Dependency versions (Cargo.toml/Cargo.lock)

This summary was automatically generated by @propel-code-bot

@sanketkedia sanketkedia force-pushed the 09-02-_enh_consolidate_retries branch from d9d5eba to 73e64da Compare September 4, 2025 23:31
self.retryable_count(request).await
}

fn is_retryable(code: tonic::Code) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The error condition mapping in is_retryable is inconsistent with the actual error codes returned by the system. The method only retries on Unavailable and Unknown errors, but the code shows that log operations return Unavailable errors (line 1426), while other operations may return different error codes. This could lead to missing legitimate retry opportunities.

fn is_retryable(code: tonic::Code) -> bool {
    matches!(code, 
        tonic::Code::Unavailable | 
        tonic::Code::Unknown |
        tonic::Code::DeadlineExceeded |
        tonic::Code::Aborted
    )
}
Context for Agents
[**BestPractice**]

The error condition mapping in `is_retryable` is inconsistent with the actual error codes returned by the system. The method only retries on `Unavailable` and `Unknown` errors, but the code shows that log operations return `Unavailable` errors (line 1426), while other operations may return different error codes. This could lead to missing legitimate retry opportunities.

```rust
fn is_retryable(code: tonic::Code) -> bool {
    matches!(code, 
        tonic::Code::Unavailable | 
        tonic::Code::Unknown |
        tonic::Code::DeadlineExceeded |
        tonic::Code::Aborted
    )
}
```

File: rust/frontend/src/impls/service_based_frontend.rs
Line: 1737

}

Ok(DeleteCollectionRecordsResponse {})
Ok(records)
}

pub async fn delete(
&mut self,
request: DeleteCollectionRecordsRequest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

The retry logic in delete method separates record fetching from log pushing, but doesn't handle the case where records change between retries. If the first operation succeeds but the second fails and retries, the same records will be deleted multiple times, potentially causing data inconsistency.

// Consider fetching records within the retry loop or implementing idempotency checks
pub async fn delete(
    &mut self,
    request: DeleteCollectionRecordsRequest,
) -> Result<DeleteCollectionRecordsResponse, DeleteCollectionRecordsError> {
    let retries = Arc::new(AtomicUsize::new(0));
    let retryable_operation = || {
        let mut self_clone = self.clone();
        let request_clone = request.clone();
        async move {
            let records = self_clone.retryable_get_records_to_delete(request_clone.clone()).await?;
            self_clone.retryable_push_delete_logs(
                request_clone.tenant_id,
                request_clone.database_name, 
                request_clone.collection_id,
                records
            ).await
        }
    };
    // ... rest of retry logic
}
Context for Agents
[**CriticalError**]

The retry logic in `delete` method separates record fetching from log pushing, but doesn't handle the case where records change between retries. If the first operation succeeds but the second fails and retries, the same records will be deleted multiple times, potentially causing data inconsistency.

```rust
// Consider fetching records within the retry loop or implementing idempotency checks
pub async fn delete(
    &mut self,
    request: DeleteCollectionRecordsRequest,
) -> Result<DeleteCollectionRecordsResponse, DeleteCollectionRecordsError> {
    let retries = Arc::new(AtomicUsize::new(0));
    let retryable_operation = || {
        let mut self_clone = self.clone();
        let request_clone = request.clone();
        async move {
            let records = self_clone.retryable_get_records_to_delete(request_clone.clone()).await?;
            self_clone.retryable_push_delete_logs(
                request_clone.tenant_id,
                request_clone.database_name, 
                request_clone.collection_id,
                records
            ).await
        }
    };
    // ... rest of retry logic
}
```

File: rust/frontend/src/impls/service_based_frontend.rs
Line: 1610

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deletes are only retried in case the log is unavailable or has asked to backoff. So there is no double write

Comment on lines +165 to +173
pub async fn count<F, Fut>(
&mut self,
plan: Count,
replan_closure: F,
) -> Result<CountResult, ExecutorError>
where
F: Fn(tonic::Code) -> Fut,
Fut: Future<Output = Result<Count, Box<dyn ChromaError>>>,
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The new retry logic with replanning is a great improvement and aligns well with the PR description. However, the implementation is duplicated across count, get, knn, and search. This could be refactored into a single generic private method to improve maintainability and reduce code duplication.

A helper function could encapsulate the retry pattern. Here's a potential signature to illustrate the idea:

async fn execute_with_retry<Plan, PlanProto, Result, ReplanFuture, GrpcFuture>(
    &self,
    initial_plan: Plan,
    clients: &[QueryClient],
    replan_closure: impl Fn(tonic::Code) -> ReplanFuture,
    grpc_call: impl Fn(QueryClient, PlanProto) -> GrpcFuture,
    metric_counter: &Counter<u64>,
) -> Result<Result, ExecutorError>
where
    Plan: Clone + TryInto<PlanProto>, // and other traits
    ReplanFuture: Future<Output = Result<Plan, Box<dyn ChromaError>>>,
    GrpcFuture: Future<Output = Result<tonic::Response<Result>, tonic::Status>>,
    Result: 'static, // and other traits
{
    // ... shared retry logic here ...
}

This would make each of the public methods (count, get, etc.) a simple call to this helper with the appropriate closures and types.

Context for Agents
[**BestPractice**]

The new retry logic with replanning is a great improvement and aligns well with the PR description. However, the implementation is duplicated across `count`, `get`, `knn`, and `search`. This could be refactored into a single generic private method to improve maintainability and reduce code duplication.

A helper function could encapsulate the retry pattern. Here's a potential signature to illustrate the idea:

```rust
async fn execute_with_retry<Plan, PlanProto, Result, ReplanFuture, GrpcFuture>(
    &self,
    initial_plan: Plan,
    clients: &[QueryClient],
    replan_closure: impl Fn(tonic::Code) -> ReplanFuture,
    grpc_call: impl Fn(QueryClient, PlanProto) -> GrpcFuture,
    metric_counter: &Counter<u64>,
) -> Result<Result, ExecutorError>
where
    Plan: Clone + TryInto<PlanProto>, // and other traits
    ReplanFuture: Future<Output = Result<Plan, Box<dyn ChromaError>>>,
    GrpcFuture: Future<Output = Result<tonic::Response<Result>, tonic::Status>>,
    Result: 'static, // and other traits
{
    // ... shared retry logic here ...
}
```

This would make each of the public methods (`count`, `get`, etc.) a simple call to this helper with the appropriate closures and types.

File: rust/frontend/src/executor/distributed.rs
Line: 173

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this

Comment on lines 312 to 314
&mut self,
CreateTenantRequest { name, .. }: CreateTenantRequest,
) -> Result<CreateTenantResponse, CreateTenantError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

It's great to see robust retry logic being added to the sysdb calls. I've noticed that the retry pattern (creating a closure, calling .retry(), .when(), .notify(), and updating metrics) is repeated for many of the sysdb methods (create_tenant, get_tenant, update_tenant, create_database, etc.).

To reduce this boilerplate, you could introduce a helper function or a macro. A helper function might look something like this:

async fn with_sysdb_retry<
    Fut: Future<Output = Result<T, E>>,
    T,
    E: ChromaError + 'static,
>(
    &self,
    operation: impl Fn() -> Fut,
    metric_counter: &Counter<u64>,
    log_message: &str,
) -> Result<T, E> {
    let retry_count = Arc::new(AtomicUsize::new(0));
    let result = operation
        .retry(self.retry_policy)
        .when(|e: &E| Self::is_retryable(e.code().into()))
        .notify(|e, _| {
            retry_count.fetch_add(1, Ordering::Relaxed);
            tracing::info!("{} failed with error {:?}. Retrying", log_message, e);
        })
        .await?;
    metric_counter.add(retry_count.load(Ordering::Relaxed) as u64, &[]);
    Ok(result)
}

This would simplify methods like create_tenant to a much more concise form, improving readability and maintainability.

Context for Agents
[**BestPractice**]

It's great to see robust retry logic being added to the sysdb calls. I've noticed that the retry pattern (creating a closure, calling `.retry()`, `.when()`, `.notify()`, and updating metrics) is repeated for many of the sysdb methods (`create_tenant`, `get_tenant`, `update_tenant`, `create_database`, etc.).

To reduce this boilerplate, you could introduce a helper function or a macro. A helper function might look something like this:

```rust
async fn with_sysdb_retry<
    Fut: Future<Output = Result<T, E>>,
    T,
    E: ChromaError + 'static,
>(
    &self,
    operation: impl Fn() -> Fut,
    metric_counter: &Counter<u64>,
    log_message: &str,
) -> Result<T, E> {
    let retry_count = Arc::new(AtomicUsize::new(0));
    let result = operation
        .retry(self.retry_policy)
        .when(|e: &E| Self::is_retryable(e.code().into()))
        .notify(|e, _| {
            retry_count.fetch_add(1, Ordering::Relaxed);
            tracing::info!("{} failed with error {:?}. Retrying", log_message, e);
        })
        .await?;
    metric_counter.add(retry_count.load(Ordering::Relaxed) as u64, &[]);
    Ok(result)
}
```

This would simplify methods like `create_tenant` to a much more concise form, improving readability and maintainability.

File: rust/frontend/src/impls/service_based_frontend.rs
Line: 314

@sanketkedia sanketkedia mentioned this pull request Sep 5, 2025
1 task
Comment on lines +123 to +145
let list_db_retries_counter = meter.u64_counter("list_database_retries").build();
let create_db_retries_counter = meter.u64_counter("create_database_retries").build();
let get_db_retries_counter = meter.u64_counter("get_database_retries").build();
let delete_db_retries_counter = meter.u64_counter("delete_database_retries").build();
let list_collections_retries_counter =
meter.u64_counter("list_collections_retries").build();
let count_collections_retries_counter =
meter.u64_counter("count_collections_retries").build();
let get_collection_retries_counter = meter.u64_counter("get_collection_retries").build();
let get_collection_by_crn_retries_counter =
meter.u64_counter("get_collection_by_crn_retries").build();
let get_tenant_retries_counter = meter.u64_counter("get_tenant_retries").build();
let create_collection_retries_counter =
meter.u64_counter("create_collection_retries").build();
let update_collection_retries_counter =
meter.u64_counter("update_collection_retries").build();
let delete_collection_retries_counter =
meter.u64_counter("delete_collection_retries").build();
let create_tenant_retries_counter = meter.u64_counter("create_tenant_retries").build();
let update_tenant_retries_counter = meter.u64_counter("update_tenant_retries").build();
let get_collection_with_segments_counter = meter
.u64_counter("get_collection_with_segments_retries")
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better if there is a single retries metric in this case and we distinguish with labels instead

@@ -279,14 +312,56 @@ impl ServiceBasedFrontend {
&mut self,
CreateTenantRequest { name, .. }: CreateTenantRequest,
) -> Result<CreateTenantResponse, CreateTenantError> {
self.sysdb_client.create_tenant(name).await
let retry_count = Arc::new(AtomicUsize::new(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need this new var? could we just clone self.metrics.create_tenant_retries_counter and pass it to the closure?

Comment on lines +61 to +69
fork_retries_counter: Counter<u64>,
delete_retries_counter: Counter<u64>,
count_retries_counter: Counter<u64>,
query_retries_counter: Counter<u64>,
search_retries_counter: Counter<u64>,
get_retries_counter: Counter<u64>,
add_retries_counter: Counter<u64>,
update_retries_counter: Counter<u64>,
upsert_retries_counter: Counter<u64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment, should this be one metric with labels?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants