Skip to content

Conversation

@colega
Copy link
Contributor

@colega colega commented Nov 8, 2025

Summary

This implements asynchronous calls to Usage-Tracker series tracking for users who are far enough from their limits.

Usage-Tracker gRPC service now has a GetUsersCloseToLimit method, which returns a sorted list of tenants who are currently close to their series limit. Each distributor replica performs a continuous polling (configurable, every 1 second by default) of this endpoint from a random partition (assuming that series distribution across partitions is uniform) to keep an up-to-date view of this list.

This feature should allow us to introduce usage-tracker in the write path without the disruption of most of the users, who are not hitting the limits.

New flags:

  • -usage-tracker.user-close-to-limit-percentage-threshold defines the percentage of what's considered being close to the limits, 90% by default.
  • distributor.usage-tracker-client.min-series-limit-for-async-tracking, if not zero, specifies the minimum amount of series limit for a tenant to be eligible for async tracking.
  • distributor.max-time-to-wait-for-async-tracking-response-after-ingestion configures the maximum amount of time usage-tracker-client is allowed to wait for tracking requests to complete once asynchronous ingestion call is complete.

New metrics:

  • cortex_usage_tracker_client_users_close_to_limit_count: length of the cache of each distributor
  • cortex_usage_tracker_client_users_close_to_limit_last_update_timestamp_seconds: timestamp of last cache update in each distributor, can be used to alert on cache updating issues
  • cortex_usage_tracker_client_users_close_to_limit_update_failures_total: number of failures of cache update in each distributors, can be used to get a full overview of cache updating failures in a cluster
  • cortex_distributor_async_usage_tracker_calls_total number of usage-tracker calls that were asynchonous, labeled by user.
  • cortex_distributor_async_usage_tracker_calls_with_rejected_series_total number of usage-tracker calls that were asynchronous yet rejected some series, labeled by user.

Checklist

  • Tests updated
  • CHANGELOG entry added
  • Documentation updated

Note

Makes usage-tracker calls asynchronous for tenants far from series limits, powered by a new GetUsersCloseToLimit API, client-side polling/cache, and related configs/metrics.

  • Distributor:
    • Adds async series tracking path gated by usageTrackerClient.CanTrackAsync(user); runs tracking in background with bounded post-ingest wait/cancel.
    • New metrics: cortex_distributor_async_usage_tracker_calls_total, ..._with_rejected_series_total.
    • Refactors series filtering into filterOutRejectedSeries(); wires client with limits; updates tests.
  • Usage-Tracker Service:
    • Tracks and exposes sorted users close to limit (threshold-based) in trackerStore; maintains sortedTenants; updates on schedule/cleanup; test hooks.
    • New config: usage-tracker.user-close-to-limit-percentage-threshold.
    • Adds gRPC: GetUsersCloseToLimit and server impl; exempted from auth.
  • Usage-Tracker Client:
    • New startup loop and periodic polling of a random partition for users-close-to-limit; caches list; exposes CanTrackAsync(user) (checks cache + min series limit).
    • New configs: poll interval, startup retries, max async wait after ingestion, min series limit for async.
    • New metrics: cache count, last update time, update failures.
    • gRPC client interceptors skip auth header for GetUsersCloseToLimit.
  • Proto/API:
    • Extends usagetracker.proto and generated code with GetUsersCloseToLimit request/response.
  • Docs/CHANGELOG:
    • Documents distributor interaction and async mode; CHANGELOG entry added.

Written by Cursor Bugbot for commit 8709490. This will update automatically on new commits. Configure here.

@colega colega requested a review from a team as a code owner November 8, 2025 17:54
@colega colega marked this pull request as draft November 8, 2025 17:56
colega added a commit that referenced this pull request Nov 8, 2025
[ENHANCEMENT] Usage-tracker: Improved write path performance by tracking
series asynchronously for tenants that are far from their series limits.
Tenants close to their limits continue to be tracked synchronously to
enforce limits strictly. #13427
@colega colega changed the title DON'T REVIEW! Async usage tracking for tenants far from limits DON'T REVIEW! Async usage tracking for users far from limits Nov 11, 2025
@colega colega changed the title DON'T REVIEW! Async usage tracking for users far from limits Async usage tracking for users far from limits Nov 12, 2025
@colega
Copy link
Contributor Author

colega commented Nov 12, 2025

bugbot run

colega added a commit that referenced this pull request Nov 12, 2025
[ENHANCEMENT] Usage-tracker: Improved write path performance by tracking
series asynchronously for tenants that are far from their series limits.
Tenants close to their limits continue to be tracked synchronously to
enforce limits strictly. #13427
@colega colega force-pushed the async-usage-tracking-far-from-limits branch from cb1e62f to 71d54e2 Compare November 12, 2025 10:58
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no bugs!


@colega colega marked this pull request as ready for review November 12, 2025 11:02
- Add GetTenantsCloseToLimit gRPC method to usage tracker proto
- Add config fields for percentage and absolute thresholds
- Implement tenant list calculation in trackerStore.updateLimits()
- Add getTenantsCloseToLimit() method to retrieve the list
- Update all test files with new constructor signature
- Add handler that returns tenants close to limit for a partition
- Support random partition selection when partition is -1 or not specified
- Handler delegates to trackerStore.getTenantsCloseToLimit()
- Add config for polling interval (default 1 second)
- Implement service lifecycle with starting/running/stopping
- Add cache for tenants close to limits using sync.Map
- Add metrics for cache size and last update timestamp
- Implement updateTenantsCloseToLimitCache() to poll random partition
- Add isTenantCloseToLimit() helper method
- Prefer instances in same zone when polling
- Add isTenantCloseToLimit() to usageTrackerGenericClient interface
- Modify prePushMaxSeriesLimitMiddleware to use async tracking for tenants far from limits
- For tenants close to limit: track synchronously (current behavior)
- For tenants far from limit: spawn goroutine and wait in cleanup function
- Update test mocks to implement new interface method
- Async tracking ensures eventual consistency without blocking write path
- Export IsTenantCloseToLimit method (was lowercase)
- Update all call sites to use uppercase method name
- Add mock expectations for IsTenantCloseToLimit in tests
- Fix partition count access in client polling code
- All usage tracker and distributor tests now pass
[ENHANCEMENT] Usage-tracker: Improved write path performance by tracking
series asynchronously for tenants that are far from their series limits.
Tenants close to their limits continue to be tracked synchronously to
enforce limits strictly. #13427
- Add -usage-tracker-client.tenants-close-to-limit-cache-startup-retries flag (default: 3)
- Populate cache at startup with configurable retries to avoid starting with empty cache
- Add cortex_usage_tracker_client_tenants_close_to_limit_update_failures_total metric
- Track all failure paths in updateTenantsCloseToLimitCache()
- Update CHANGELOG with new flag and metric
- Rename proto RPC from GetTenantsCloseToLimit to GetUsersCloseToLimit
- Rename proto message fields: tenant_ids -> user_ids
- Update config flags:
  - usage-tracker.tenant-close-to-limit-* -> usage-tracker.user-close-to-limit-*
  - usage-tracker-client.tenants-close-to-limit-* -> usage-tracker-client.users-close-to-limit-*
- Update metrics:
  - cortex_usage_tracker_client_tenants_close_to_limit_* -> cortex_usage_tracker_client_users_close_to_limit_*
- Rename internal variables, methods, and comments accordingly
- Update CHANGELOG to reflect new naming
Add new section explaining the async tracking feature for users
far from their limits:
- User proximity detection with percentage and absolute thresholds
- GetUsersCloseToLimit gRPC API
- Client-side caching with polling mechanism
- Conditional tracking in distributor middleware
- Exported metrics
- Trade-offs and considerations
- Remove user_close_to_limit_absolute_threshold config option
- Simplify updateLimits logic to only use percentage threshold
- Update newTrackerStore signature to remove absolute threshold parameter
- Update all test calls to newTrackerStore
- Update CHANGELOG to reflect single threshold
- Update documentation to remove absolute threshold references

A user is now considered close to their limit only if:
  series >= (localSeriesLimit * percentageThreshold / 100)
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
Signed-off-by: Oleg Zaytsev <[email protected]>
@colega colega force-pushed the async-usage-tracking-far-from-limits branch from 8ce6edf to 32a3572 Compare November 12, 2025 14:18
Signed-off-by: Oleg Zaytsev <[email protected]>
Copy link
Contributor

@tcard tcard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few non-blockers, but looks good as far as I can tell 👍

Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog LGTM

colega and others added 3 commits November 13, 2025 12:55
@colega colega requested a review from stevesg as a code owner November 13, 2025 12:59
@colega colega merged commit d3994ed into main Nov 13, 2025
39 checks passed
@colega colega deleted the async-usage-tracking-far-from-limits branch November 13, 2025 14:45
@mimir-github-bot
Copy link
Contributor

The backport to r368-async-usage-tracker failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-13427-to-r368-async-usage-tracker origin/r368-async-usage-tracker
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x d3994edf5109dd5bb8e66afb199796553fdf028c
# Push it to GitHub
git push --set-upstream origin backport-13427-to-r368-async-usage-tracker
git switch main
# Remove the local backport branch
git branch -D backport-13427-to-r368-async-usage-tracker

Then, create a pull request where the base branch is r368-async-usage-tracker and the compare/head branch is backport-13427-to-r368-async-usage-tracker.

colega added a commit that referenced this pull request Nov 13, 2025
This implements asynchronous calls to Usage-Tracker series tracking for
users who are far enough from their limits.

Usage-Tracker gRPC service now has a `GetUsersCloseToLimit` method,
which returns a sorted list of tenants who are currently close to their
series limit. Each distributor replica performs a continuous polling
(configurable, every 1 second by default) of this endpoint from a random
partition (assuming that series distribution across partitions is
uniform) to keep an up-to-date view of this list.

This feature should allow us to introduce usage-tracker in the write
path without the disruption of most of the users, who are not hitting
the limits.

New flags:
- `-usage-tracker.user-close-to-limit-percentage-threshold` defines the
percentage of what's considered being close to the limits, 90% by
default.
-
`distributor.usage-tracker-client.min-series-limit-for-async-tracking`,
if not zero, specifies the minimum amount of series limit for a tenant
to be eligible for async tracking.
-
`distributor.max-time-to-wait-for-async-tracking-response-after-ingestion`
configures the maximum amount of time usage-tracker-client is allowed to
wait for tracking requests to complete once asynchronous ingestion call
is complete.

New metrics:
- `cortex_usage_tracker_client_users_close_to_limit_count`: length of
the cache of each distributor
-
`cortex_usage_tracker_client_users_close_to_limit_last_update_timestamp_seconds`:
timestamp of last cache update in each distributor, can be used to alert
on cache updating issues
-
`cortex_usage_tracker_client_users_close_to_limit_update_failures_total`:
number of failures of cache update in each distributors, can be used to
get a full overview of cache updating failures in a cluster
- `cortex_distributor_async_usage_tracker_calls_total` number of
usage-tracker calls that were asynchonous, labeled by `user`.
-
`cortex_distributor_async_usage_tracker_calls_with_rejected_series_total`
number of usage-tracker calls that were asynchronous yet rejected some
series, labeled by `user`.

- [x] Tests updated
- [x] CHANGELOG entry added
- [ ] Documentation updated

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Makes usage-tracker calls asynchronous for tenants far from series
limits, powered by a new GetUsersCloseToLimit API, client-side
polling/cache, and related configs/metrics.
>
> - **Distributor**:
> - Adds async series tracking path gated by
`usageTrackerClient.CanTrackAsync(user)`; runs tracking in background
with bounded post-ingest wait/cancel.
> - New metrics: `cortex_distributor_async_usage_tracker_calls_total`,
`..._with_rejected_series_total`.
> - Refactors series filtering into `filterOutRejectedSeries()`; wires
client with limits; updates tests.
> - **Usage-Tracker Service**:
> - Tracks and exposes sorted users close to limit (threshold-based) in
`trackerStore`; maintains `sortedTenants`; updates on schedule/cleanup;
test hooks.
> - New config:
`usage-tracker.user-close-to-limit-percentage-threshold`.
> - Adds gRPC: `GetUsersCloseToLimit` and server impl; exempted from
auth.
> - **Usage-Tracker Client**:
> - New startup loop and periodic polling of a random partition for
users-close-to-limit; caches list; exposes `CanTrackAsync(user)` (checks
cache + min series limit).
> - New configs: poll interval, startup retries, max async wait after
ingestion, min series limit for async.
>   - New metrics: cache count, last update time, update failures.
> - gRPC client interceptors skip auth header for
`GetUsersCloseToLimit`.
> - **Proto/API**:
> - Extends `usagetracker.proto` and generated code with
`GetUsersCloseToLimit` request/response.
> - **Docs/CHANGELOG**:
> - Documents distributor interaction and async mode; CHANGELOG entry
added.
>
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
8709490. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Toni Cárdenas <[email protected]>
(cherry picked from commit d3994ed)
colega added a commit that referenced this pull request Nov 13, 2025
Backporting #13427 

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Enables async series tracking for tenants far from limits by adding a
users-close-to-limit RPC and client cache, integrating it in the
distributor, and adding related configs, metrics, and tests.
> 
> - **Distributor**:
> - Add async path for usage tracking when `CanTrackAsync(user)` is
true; waits briefly post-ingestion and cancels if slow.
> - New metrics: `cortex_distributor_async_usage_tracker_calls_total`
and `_with_rejected_series_total`.
> - Refactor series rejection filtering into
`filterOutRejectedSeries()`.
>   - Wire updated client ctor (passes limits) and metric cleanup.
> - **Usage-tracker Service**:
> - New RPC `GetUsersCloseToLimit` and auth bypass for it; add
proto/messages.
> - Track and expose sorted users-close-to-limit based on new threshold;
maintain `sortedTenants` and periodic `updateLimits()` logic.
>   - Config: `usage-tracker.user-close-to-limit-percentage-threshold`.
>   - Partition handler: test hook to force limit updates.
> - **Usage-tracker Client**:
> - Background poll of a random partition for users-close-to-limit;
cache + metrics (`*_users_close_to_limit_*`).
> - New `CanTrackAsync(user)`; constructor now requires limits provider.
> - Config: poll interval, startup retries, max wait after ingestion,
min series limit for async.
>   - gRPC instrumentation excludes auth header for new method.
> - **Other**:
> - Docs: distributor/usage-tracker interaction updated for async path.
> - Tests updated/added across distributor, tracker, client; load
generator adapted.
>   - CHANGELOG and go.mod (opentracing deps) updated.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
a892452. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Toni Cárdenas <[email protected]>
sherinabr pushed a commit that referenced this pull request Nov 13, 2025
## Summary

This implements asynchronous calls to Usage-Tracker series tracking for
users who are far enough from their limits.

Usage-Tracker gRPC service now has a `GetUsersCloseToLimit` method,
which returns a sorted list of tenants who are currently close to their
series limit. Each distributor replica performs a continuous polling
(configurable, every 1 second by default) of this endpoint from a random
partition (assuming that series distribution across partitions is
uniform) to keep an up-to-date view of this list.

This feature should allow us to introduce usage-tracker in the write
path without the disruption of most of the users, who are not hitting
the limits.

New flags:
- `-usage-tracker.user-close-to-limit-percentage-threshold` defines the
percentage of what's considered being close to the limits, 90% by
default.
-
`distributor.usage-tracker-client.min-series-limit-for-async-tracking`,
if not zero, specifies the minimum amount of series limit for a tenant
to be eligible for async tracking.
-
`distributor.max-time-to-wait-for-async-tracking-response-after-ingestion`
configures the maximum amount of time usage-tracker-client is allowed to
wait for tracking requests to complete once asynchronous ingestion call
is complete.

New metrics:
- `cortex_usage_tracker_client_users_close_to_limit_count`: length of
the cache of each distributor
-
`cortex_usage_tracker_client_users_close_to_limit_last_update_timestamp_seconds`:
timestamp of last cache update in each distributor, can be used to alert
on cache updating issues
-
`cortex_usage_tracker_client_users_close_to_limit_update_failures_total`:
number of failures of cache update in each distributors, can be used to
get a full overview of cache updating failures in a cluster
- `cortex_distributor_async_usage_tracker_calls_total` number of
usage-tracker calls that were asynchonous, labeled by `user`.
-
`cortex_distributor_async_usage_tracker_calls_with_rejected_series_total`
number of usage-tracker calls that were asynchronous yet rejected some
series, labeled by `user`.

## Checklist

- [x] Tests updated
- [x] CHANGELOG entry added
- [ ] Documentation updated

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Makes usage-tracker calls asynchronous for tenants far from series
limits, powered by a new GetUsersCloseToLimit API, client-side
polling/cache, and related configs/metrics.
> 
> - **Distributor**:
> - Adds async series tracking path gated by
`usageTrackerClient.CanTrackAsync(user)`; runs tracking in background
with bounded post-ingest wait/cancel.
> - New metrics: `cortex_distributor_async_usage_tracker_calls_total`,
`..._with_rejected_series_total`.
> - Refactors series filtering into `filterOutRejectedSeries()`; wires
client with limits; updates tests.
> - **Usage-Tracker Service**:
> - Tracks and exposes sorted users close to limit (threshold-based) in
`trackerStore`; maintains `sortedTenants`; updates on schedule/cleanup;
test hooks.
> - New config:
`usage-tracker.user-close-to-limit-percentage-threshold`.
> - Adds gRPC: `GetUsersCloseToLimit` and server impl; exempted from
auth.
> - **Usage-Tracker Client**:
> - New startup loop and periodic polling of a random partition for
users-close-to-limit; caches list; exposes `CanTrackAsync(user)` (checks
cache + min series limit).
> - New configs: poll interval, startup retries, max async wait after
ingestion, min series limit for async.
>   - New metrics: cache count, last update time, update failures.
> - gRPC client interceptors skip auth header for
`GetUsersCloseToLimit`.
> - **Proto/API**:
> - Extends `usagetracker.proto` and generated code with
`GetUsersCloseToLimit` request/response.
> - **Docs/CHANGELOG**:
> - Documents distributor interaction and async mode; CHANGELOG entry
added.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
8709490. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Toni Cárdenas <[email protected]>
sherinabr pushed a commit that referenced this pull request Nov 13, 2025
## Summary

This implements asynchronous calls to Usage-Tracker series tracking for
users who are far enough from their limits.

Usage-Tracker gRPC service now has a `GetUsersCloseToLimit` method,
which returns a sorted list of tenants who are currently close to their
series limit. Each distributor replica performs a continuous polling
(configurable, every 1 second by default) of this endpoint from a random
partition (assuming that series distribution across partitions is
uniform) to keep an up-to-date view of this list.

This feature should allow us to introduce usage-tracker in the write
path without the disruption of most of the users, who are not hitting
the limits.

New flags:
- `-usage-tracker.user-close-to-limit-percentage-threshold` defines the
percentage of what's considered being close to the limits, 90% by
default.
-
`distributor.usage-tracker-client.min-series-limit-for-async-tracking`,
if not zero, specifies the minimum amount of series limit for a tenant
to be eligible for async tracking.
-
`distributor.max-time-to-wait-for-async-tracking-response-after-ingestion`
configures the maximum amount of time usage-tracker-client is allowed to
wait for tracking requests to complete once asynchronous ingestion call
is complete.

New metrics:
- `cortex_usage_tracker_client_users_close_to_limit_count`: length of
the cache of each distributor
-
`cortex_usage_tracker_client_users_close_to_limit_last_update_timestamp_seconds`:
timestamp of last cache update in each distributor, can be used to alert
on cache updating issues
-
`cortex_usage_tracker_client_users_close_to_limit_update_failures_total`:
number of failures of cache update in each distributors, can be used to
get a full overview of cache updating failures in a cluster
- `cortex_distributor_async_usage_tracker_calls_total` number of
usage-tracker calls that were asynchonous, labeled by `user`.
-
`cortex_distributor_async_usage_tracker_calls_with_rejected_series_total`
number of usage-tracker calls that were asynchronous yet rejected some
series, labeled by `user`.

## Checklist

- [x] Tests updated
- [x] CHANGELOG entry added
- [ ] Documentation updated

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Makes usage-tracker calls asynchronous for tenants far from series
limits, powered by a new GetUsersCloseToLimit API, client-side
polling/cache, and related configs/metrics.
> 
> - **Distributor**:
> - Adds async series tracking path gated by
`usageTrackerClient.CanTrackAsync(user)`; runs tracking in background
with bounded post-ingest wait/cancel.
> - New metrics: `cortex_distributor_async_usage_tracker_calls_total`,
`..._with_rejected_series_total`.
> - Refactors series filtering into `filterOutRejectedSeries()`; wires
client with limits; updates tests.
> - **Usage-Tracker Service**:
> - Tracks and exposes sorted users close to limit (threshold-based) in
`trackerStore`; maintains `sortedTenants`; updates on schedule/cleanup;
test hooks.
> - New config:
`usage-tracker.user-close-to-limit-percentage-threshold`.
> - Adds gRPC: `GetUsersCloseToLimit` and server impl; exempted from
auth.
> - **Usage-Tracker Client**:
> - New startup loop and periodic polling of a random partition for
users-close-to-limit; caches list; exposes `CanTrackAsync(user)` (checks
cache + min series limit).
> - New configs: poll interval, startup retries, max async wait after
ingestion, min series limit for async.
>   - New metrics: cache count, last update time, update failures.
> - gRPC client interceptors skip auth header for
`GetUsersCloseToLimit`.
> - **Proto/API**:
> - Extends `usagetracker.proto` and generated code with
`GetUsersCloseToLimit` request/response.
> - **Docs/CHANGELOG**:
> - Documents distributor interaction and async mode; CHANGELOG entry
added.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
8709490. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Toni Cárdenas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants