-
Notifications
You must be signed in to change notification settings - Fork 671
Async usage tracking for users far from limits #13427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ENHANCEMENT] Usage-tracker: Improved write path performance by tracking series asynchronously for tenants that are far from their series limits. Tenants close to their limits continue to be tracked synchronously to enforce limits strictly. #13427
|
bugbot run |
[ENHANCEMENT] Usage-tracker: Improved write path performance by tracking series asynchronously for tenants that are far from their series limits. Tenants close to their limits continue to be tracked synchronously to enforce limits strictly. #13427
cb1e62f to
71d54e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Bugbot reviewed your changes and found no bugs!
- Add GetTenantsCloseToLimit gRPC method to usage tracker proto - Add config fields for percentage and absolute thresholds - Implement tenant list calculation in trackerStore.updateLimits() - Add getTenantsCloseToLimit() method to retrieve the list - Update all test files with new constructor signature
- Add handler that returns tenants close to limit for a partition - Support random partition selection when partition is -1 or not specified - Handler delegates to trackerStore.getTenantsCloseToLimit()
- Add config for polling interval (default 1 second) - Implement service lifecycle with starting/running/stopping - Add cache for tenants close to limits using sync.Map - Add metrics for cache size and last update timestamp - Implement updateTenantsCloseToLimitCache() to poll random partition - Add isTenantCloseToLimit() helper method - Prefer instances in same zone when polling
- Add isTenantCloseToLimit() to usageTrackerGenericClient interface - Modify prePushMaxSeriesLimitMiddleware to use async tracking for tenants far from limits - For tenants close to limit: track synchronously (current behavior) - For tenants far from limit: spawn goroutine and wait in cleanup function - Update test mocks to implement new interface method - Async tracking ensures eventual consistency without blocking write path
- Export IsTenantCloseToLimit method (was lowercase) - Update all call sites to use uppercase method name - Add mock expectations for IsTenantCloseToLimit in tests - Fix partition count access in client polling code - All usage tracker and distributor tests now pass
[ENHANCEMENT] Usage-tracker: Improved write path performance by tracking series asynchronously for tenants that are far from their series limits. Tenants close to their limits continue to be tracked synchronously to enforce limits strictly. #13427
- Add -usage-tracker-client.tenants-close-to-limit-cache-startup-retries flag (default: 3) - Populate cache at startup with configurable retries to avoid starting with empty cache - Add cortex_usage_tracker_client_tenants_close_to_limit_update_failures_total metric - Track all failure paths in updateTenantsCloseToLimitCache() - Update CHANGELOG with new flag and metric
- Rename proto RPC from GetTenantsCloseToLimit to GetUsersCloseToLimit - Rename proto message fields: tenant_ids -> user_ids - Update config flags: - usage-tracker.tenant-close-to-limit-* -> usage-tracker.user-close-to-limit-* - usage-tracker-client.tenants-close-to-limit-* -> usage-tracker-client.users-close-to-limit-* - Update metrics: - cortex_usage_tracker_client_tenants_close_to_limit_* -> cortex_usage_tracker_client_users_close_to_limit_* - Rename internal variables, methods, and comments accordingly - Update CHANGELOG to reflect new naming
Add new section explaining the async tracking feature for users far from their limits: - User proximity detection with percentage and absolute thresholds - GetUsersCloseToLimit gRPC API - Client-side caching with polling mechanism - Conditional tracking in distributor middleware - Exported metrics - Trade-offs and considerations
- Remove user_close_to_limit_absolute_threshold config option - Simplify updateLimits logic to only use percentage threshold - Update newTrackerStore signature to remove absolute threshold parameter - Update all test calls to newTrackerStore - Update CHANGELOG to reflect single threshold - Update documentation to remove absolute threshold references A user is now considered close to their limit only if: series >= (localSeriesLimit * percentageThreshold / 100)
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
8ce6edf to
32a3572
Compare
…ToLimit method Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
tcard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few non-blockers, but looks good as far as I can tell 👍
tacole02
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changelog LGTM
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Toni Cárdenas <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
|
The backport to To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-13427-to-r368-async-usage-tracker origin/r368-async-usage-tracker
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x d3994edf5109dd5bb8e66afb199796553fdf028c
# Push it to GitHub
git push --set-upstream origin backport-13427-to-r368-async-usage-tracker
git switch main
# Remove the local backport branch
git branch -D backport-13427-to-r368-async-usage-trackerThen, create a pull request where the |
This implements asynchronous calls to Usage-Tracker series tracking for users who are far enough from their limits. Usage-Tracker gRPC service now has a `GetUsersCloseToLimit` method, which returns a sorted list of tenants who are currently close to their series limit. Each distributor replica performs a continuous polling (configurable, every 1 second by default) of this endpoint from a random partition (assuming that series distribution across partitions is uniform) to keep an up-to-date view of this list. This feature should allow us to introduce usage-tracker in the write path without the disruption of most of the users, who are not hitting the limits. New flags: - `-usage-tracker.user-close-to-limit-percentage-threshold` defines the percentage of what's considered being close to the limits, 90% by default. - `distributor.usage-tracker-client.min-series-limit-for-async-tracking`, if not zero, specifies the minimum amount of series limit for a tenant to be eligible for async tracking. - `distributor.max-time-to-wait-for-async-tracking-response-after-ingestion` configures the maximum amount of time usage-tracker-client is allowed to wait for tracking requests to complete once asynchronous ingestion call is complete. New metrics: - `cortex_usage_tracker_client_users_close_to_limit_count`: length of the cache of each distributor - `cortex_usage_tracker_client_users_close_to_limit_last_update_timestamp_seconds`: timestamp of last cache update in each distributor, can be used to alert on cache updating issues - `cortex_usage_tracker_client_users_close_to_limit_update_failures_total`: number of failures of cache update in each distributors, can be used to get a full overview of cache updating failures in a cluster - `cortex_distributor_async_usage_tracker_calls_total` number of usage-tracker calls that were asynchonous, labeled by `user`. - `cortex_distributor_async_usage_tracker_calls_with_rejected_series_total` number of usage-tracker calls that were asynchronous yet rejected some series, labeled by `user`. - [x] Tests updated - [x] CHANGELOG entry added - [ ] Documentation updated <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Makes usage-tracker calls asynchronous for tenants far from series limits, powered by a new GetUsersCloseToLimit API, client-side polling/cache, and related configs/metrics. > > - **Distributor**: > - Adds async series tracking path gated by `usageTrackerClient.CanTrackAsync(user)`; runs tracking in background with bounded post-ingest wait/cancel. > - New metrics: `cortex_distributor_async_usage_tracker_calls_total`, `..._with_rejected_series_total`. > - Refactors series filtering into `filterOutRejectedSeries()`; wires client with limits; updates tests. > - **Usage-Tracker Service**: > - Tracks and exposes sorted users close to limit (threshold-based) in `trackerStore`; maintains `sortedTenants`; updates on schedule/cleanup; test hooks. > - New config: `usage-tracker.user-close-to-limit-percentage-threshold`. > - Adds gRPC: `GetUsersCloseToLimit` and server impl; exempted from auth. > - **Usage-Tracker Client**: > - New startup loop and periodic polling of a random partition for users-close-to-limit; caches list; exposes `CanTrackAsync(user)` (checks cache + min series limit). > - New configs: poll interval, startup retries, max async wait after ingestion, min series limit for async. > - New metrics: cache count, last update time, update failures. > - gRPC client interceptors skip auth header for `GetUsersCloseToLimit`. > - **Proto/API**: > - Extends `usagetracker.proto` and generated code with `GetUsersCloseToLimit` request/response. > - **Docs/CHANGELOG**: > - Documents distributor interaction and async mode; CHANGELOG entry added. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 8709490. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Oleg Zaytsev <[email protected]> Co-authored-by: Toni Cárdenas <[email protected]> (cherry picked from commit d3994ed)
Backporting #13427 <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Enables async series tracking for tenants far from limits by adding a users-close-to-limit RPC and client cache, integrating it in the distributor, and adding related configs, metrics, and tests. > > - **Distributor**: > - Add async path for usage tracking when `CanTrackAsync(user)` is true; waits briefly post-ingestion and cancels if slow. > - New metrics: `cortex_distributor_async_usage_tracker_calls_total` and `_with_rejected_series_total`. > - Refactor series rejection filtering into `filterOutRejectedSeries()`. > - Wire updated client ctor (passes limits) and metric cleanup. > - **Usage-tracker Service**: > - New RPC `GetUsersCloseToLimit` and auth bypass for it; add proto/messages. > - Track and expose sorted users-close-to-limit based on new threshold; maintain `sortedTenants` and periodic `updateLimits()` logic. > - Config: `usage-tracker.user-close-to-limit-percentage-threshold`. > - Partition handler: test hook to force limit updates. > - **Usage-tracker Client**: > - Background poll of a random partition for users-close-to-limit; cache + metrics (`*_users_close_to_limit_*`). > - New `CanTrackAsync(user)`; constructor now requires limits provider. > - Config: poll interval, startup retries, max wait after ingestion, min series limit for async. > - gRPC instrumentation excludes auth header for new method. > - **Other**: > - Docs: distributor/usage-tracker interaction updated for async path. > - Tests updated/added across distributor, tracker, client; load generator adapted. > - CHANGELOG and go.mod (opentracing deps) updated. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit a892452. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Signed-off-by: Oleg Zaytsev <[email protected]> Co-authored-by: Toni Cárdenas <[email protected]>
## Summary This implements asynchronous calls to Usage-Tracker series tracking for users who are far enough from their limits. Usage-Tracker gRPC service now has a `GetUsersCloseToLimit` method, which returns a sorted list of tenants who are currently close to their series limit. Each distributor replica performs a continuous polling (configurable, every 1 second by default) of this endpoint from a random partition (assuming that series distribution across partitions is uniform) to keep an up-to-date view of this list. This feature should allow us to introduce usage-tracker in the write path without the disruption of most of the users, who are not hitting the limits. New flags: - `-usage-tracker.user-close-to-limit-percentage-threshold` defines the percentage of what's considered being close to the limits, 90% by default. - `distributor.usage-tracker-client.min-series-limit-for-async-tracking`, if not zero, specifies the minimum amount of series limit for a tenant to be eligible for async tracking. - `distributor.max-time-to-wait-for-async-tracking-response-after-ingestion` configures the maximum amount of time usage-tracker-client is allowed to wait for tracking requests to complete once asynchronous ingestion call is complete. New metrics: - `cortex_usage_tracker_client_users_close_to_limit_count`: length of the cache of each distributor - `cortex_usage_tracker_client_users_close_to_limit_last_update_timestamp_seconds`: timestamp of last cache update in each distributor, can be used to alert on cache updating issues - `cortex_usage_tracker_client_users_close_to_limit_update_failures_total`: number of failures of cache update in each distributors, can be used to get a full overview of cache updating failures in a cluster - `cortex_distributor_async_usage_tracker_calls_total` number of usage-tracker calls that were asynchonous, labeled by `user`. - `cortex_distributor_async_usage_tracker_calls_with_rejected_series_total` number of usage-tracker calls that were asynchronous yet rejected some series, labeled by `user`. ## Checklist - [x] Tests updated - [x] CHANGELOG entry added - [ ] Documentation updated <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Makes usage-tracker calls asynchronous for tenants far from series limits, powered by a new GetUsersCloseToLimit API, client-side polling/cache, and related configs/metrics. > > - **Distributor**: > - Adds async series tracking path gated by `usageTrackerClient.CanTrackAsync(user)`; runs tracking in background with bounded post-ingest wait/cancel. > - New metrics: `cortex_distributor_async_usage_tracker_calls_total`, `..._with_rejected_series_total`. > - Refactors series filtering into `filterOutRejectedSeries()`; wires client with limits; updates tests. > - **Usage-Tracker Service**: > - Tracks and exposes sorted users close to limit (threshold-based) in `trackerStore`; maintains `sortedTenants`; updates on schedule/cleanup; test hooks. > - New config: `usage-tracker.user-close-to-limit-percentage-threshold`. > - Adds gRPC: `GetUsersCloseToLimit` and server impl; exempted from auth. > - **Usage-Tracker Client**: > - New startup loop and periodic polling of a random partition for users-close-to-limit; caches list; exposes `CanTrackAsync(user)` (checks cache + min series limit). > - New configs: poll interval, startup retries, max async wait after ingestion, min series limit for async. > - New metrics: cache count, last update time, update failures. > - gRPC client interceptors skip auth header for `GetUsersCloseToLimit`. > - **Proto/API**: > - Extends `usagetracker.proto` and generated code with `GetUsersCloseToLimit` request/response. > - **Docs/CHANGELOG**: > - Documents distributor interaction and async mode; CHANGELOG entry added. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 8709490. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Oleg Zaytsev <[email protected]> Co-authored-by: Toni Cárdenas <[email protected]>
## Summary This implements asynchronous calls to Usage-Tracker series tracking for users who are far enough from their limits. Usage-Tracker gRPC service now has a `GetUsersCloseToLimit` method, which returns a sorted list of tenants who are currently close to their series limit. Each distributor replica performs a continuous polling (configurable, every 1 second by default) of this endpoint from a random partition (assuming that series distribution across partitions is uniform) to keep an up-to-date view of this list. This feature should allow us to introduce usage-tracker in the write path without the disruption of most of the users, who are not hitting the limits. New flags: - `-usage-tracker.user-close-to-limit-percentage-threshold` defines the percentage of what's considered being close to the limits, 90% by default. - `distributor.usage-tracker-client.min-series-limit-for-async-tracking`, if not zero, specifies the minimum amount of series limit for a tenant to be eligible for async tracking. - `distributor.max-time-to-wait-for-async-tracking-response-after-ingestion` configures the maximum amount of time usage-tracker-client is allowed to wait for tracking requests to complete once asynchronous ingestion call is complete. New metrics: - `cortex_usage_tracker_client_users_close_to_limit_count`: length of the cache of each distributor - `cortex_usage_tracker_client_users_close_to_limit_last_update_timestamp_seconds`: timestamp of last cache update in each distributor, can be used to alert on cache updating issues - `cortex_usage_tracker_client_users_close_to_limit_update_failures_total`: number of failures of cache update in each distributors, can be used to get a full overview of cache updating failures in a cluster - `cortex_distributor_async_usage_tracker_calls_total` number of usage-tracker calls that were asynchonous, labeled by `user`. - `cortex_distributor_async_usage_tracker_calls_with_rejected_series_total` number of usage-tracker calls that were asynchronous yet rejected some series, labeled by `user`. ## Checklist - [x] Tests updated - [x] CHANGELOG entry added - [ ] Documentation updated <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Makes usage-tracker calls asynchronous for tenants far from series limits, powered by a new GetUsersCloseToLimit API, client-side polling/cache, and related configs/metrics. > > - **Distributor**: > - Adds async series tracking path gated by `usageTrackerClient.CanTrackAsync(user)`; runs tracking in background with bounded post-ingest wait/cancel. > - New metrics: `cortex_distributor_async_usage_tracker_calls_total`, `..._with_rejected_series_total`. > - Refactors series filtering into `filterOutRejectedSeries()`; wires client with limits; updates tests. > - **Usage-Tracker Service**: > - Tracks and exposes sorted users close to limit (threshold-based) in `trackerStore`; maintains `sortedTenants`; updates on schedule/cleanup; test hooks. > - New config: `usage-tracker.user-close-to-limit-percentage-threshold`. > - Adds gRPC: `GetUsersCloseToLimit` and server impl; exempted from auth. > - **Usage-Tracker Client**: > - New startup loop and periodic polling of a random partition for users-close-to-limit; caches list; exposes `CanTrackAsync(user)` (checks cache + min series limit). > - New configs: poll interval, startup retries, max async wait after ingestion, min series limit for async. > - New metrics: cache count, last update time, update failures. > - gRPC client interceptors skip auth header for `GetUsersCloseToLimit`. > - **Proto/API**: > - Extends `usagetracker.proto` and generated code with `GetUsersCloseToLimit` request/response. > - **Docs/CHANGELOG**: > - Documents distributor interaction and async mode; CHANGELOG entry added. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 8709490. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Signed-off-by: Oleg Zaytsev <[email protected]> Co-authored-by: Toni Cárdenas <[email protected]>
Summary
This implements asynchronous calls to Usage-Tracker series tracking for users who are far enough from their limits.
Usage-Tracker gRPC service now has a
GetUsersCloseToLimitmethod, which returns a sorted list of tenants who are currently close to their series limit. Each distributor replica performs a continuous polling (configurable, every 1 second by default) of this endpoint from a random partition (assuming that series distribution across partitions is uniform) to keep an up-to-date view of this list.This feature should allow us to introduce usage-tracker in the write path without the disruption of most of the users, who are not hitting the limits.
New flags:
-usage-tracker.user-close-to-limit-percentage-thresholddefines the percentage of what's considered being close to the limits, 90% by default.distributor.usage-tracker-client.min-series-limit-for-async-tracking, if not zero, specifies the minimum amount of series limit for a tenant to be eligible for async tracking.distributor.max-time-to-wait-for-async-tracking-response-after-ingestionconfigures the maximum amount of time usage-tracker-client is allowed to wait for tracking requests to complete once asynchronous ingestion call is complete.New metrics:
cortex_usage_tracker_client_users_close_to_limit_count: length of the cache of each distributorcortex_usage_tracker_client_users_close_to_limit_last_update_timestamp_seconds: timestamp of last cache update in each distributor, can be used to alert on cache updating issuescortex_usage_tracker_client_users_close_to_limit_update_failures_total: number of failures of cache update in each distributors, can be used to get a full overview of cache updating failures in a clustercortex_distributor_async_usage_tracker_calls_totalnumber of usage-tracker calls that were asynchonous, labeled byuser.cortex_distributor_async_usage_tracker_calls_with_rejected_series_totalnumber of usage-tracker calls that were asynchronous yet rejected some series, labeled byuser.Checklist
Note
Makes usage-tracker calls asynchronous for tenants far from series limits, powered by a new GetUsersCloseToLimit API, client-side polling/cache, and related configs/metrics.
usageTrackerClient.CanTrackAsync(user); runs tracking in background with bounded post-ingest wait/cancel.cortex_distributor_async_usage_tracker_calls_total,..._with_rejected_series_total.filterOutRejectedSeries(); wires client with limits; updates tests.trackerStore; maintainssortedTenants; updates on schedule/cleanup; test hooks.usage-tracker.user-close-to-limit-percentage-threshold.GetUsersCloseToLimitand server impl; exempted from auth.CanTrackAsync(user)(checks cache + min series limit).GetUsersCloseToLimit.usagetracker.protoand generated code withGetUsersCloseToLimitrequest/response.Written by Cursor Bugbot for commit 8709490. This will update automatically on new commits. Configure here.