-
Notifications
You must be signed in to change notification settings - Fork 686
feat: kv router should route to available instances #4225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: PeaBrane <[email protected]>
WalkthroughThis refactoring replaces Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Poem
Pre-merge checks❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/llm/src/discovery/model_manager.rs (1)
58-60: Update map key comment for clarity
kv_choosersis now keyed byendpoint.path(), so the “component service_name” note is stale. Tweaking the comment will save future head-scratching.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
lib/bindings/c/src/lib.rs(2 hunks)lib/bindings/python/rust/llm/kv.rs(1 hunks)lib/llm/src/discovery/model_manager.rs(3 hunks)lib/llm/src/discovery/watcher.rs(1 hunks)lib/llm/src/entrypoint/input/common.rs(2 hunks)lib/llm/src/entrypoint/input/grpc.rs(1 hunks)lib/llm/src/entrypoint/input/http.rs(1 hunks)lib/llm/src/kv_router.rs(5 hunks)lib/llm/src/kv_router/prefill_router.rs(2 hunks)lib/llm/src/kv_router/scheduler.rs(5 hunks)lib/runtime/src/component/client.rs(6 hunks)
🧰 Additional context used
🧠 Learnings (14)
📓 Common learnings
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3095
File: lib/llm/src/kv_router/indexer.rs:0-0
Timestamp: 2025-09-17T20:55:06.333Z
Learning: When PeaBrane encounters a complex implementation issue that would significantly expand PR scope (like the remove_worker_sender method in lib/llm/src/kv_router/indexer.rs that required thread-safe map updates and proper shard targeting), they prefer to remove the problematic implementation entirely rather than rush a partial fix, deferring the proper solution to a future PR.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3184
File: docs/architecture/kv_cache_routing.md:70-73
Timestamp: 2025-09-23T20:08:37.105Z
Learning: PeaBrane prefers to keep documentation diagrams simplified to avoid visual overload, even when this means sacrificing some technical precision for the sake of clarity and comprehension. They prioritize pedagogical effectiveness over exhaustive technical detail in architectural diagrams.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.330Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3095
File: lib/llm/src/kv_router/subscriber.rs:200-223
Timestamp: 2025-09-17T20:55:41.416Z
Learning: In the dynamo codebase, PeaBrane prefers to maintain consistency with existing etcd key parsing patterns (like splitting on '/' and parsing the last segment) rather than introducing more robust parsing approaches, even when the current approach might be brittle, to keep the codebase aligned and avoid divergent patterns.
📚 Learning: 2025-08-21T17:23:02.836Z
Learnt from: michaelfeil
Repo: ai-dynamo/dynamo PR: 2591
File: lib/bindings/python/rust/http.rs:0-0
Timestamp: 2025-08-21T17:23:02.836Z
Learning: In lib/bindings/python/rust/http.rs, the enable_endpoint method uses EndpointType::all() to dynamically support all available endpoint types with case-insensitive matching, which is more maintainable than hardcoded match statements for endpoint type mappings.
Applied to files:
lib/bindings/c/src/lib.rslib/bindings/python/rust/llm/kv.rslib/llm/src/entrypoint/input/http.rs
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.
Applied to files:
lib/llm/src/discovery/watcher.rslib/llm/src/discovery/model_manager.rs
📚 Learning: 2025-08-29T10:08:18.434Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2756
File: lib/bindings/python/rust/llm/kv.rs:401-436
Timestamp: 2025-08-29T10:08:18.434Z
Learning: In the Python KvIndexer bindings (lib/bindings/python/rust/llm/kv.rs), the hardcoded reset_states=true parameter passed to start_kv_router_background is intentional behavior, not an oversight that needs to be made configurable.
Applied to files:
lib/bindings/python/rust/llm/kv.rslib/llm/src/entrypoint/input/common.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.
Applied to files:
lib/bindings/python/rust/llm/kv.rslib/llm/src/entrypoint/input/common.rslib/llm/src/kv_router/prefill_router.rs
📚 Learning: 2025-05-30T06:38:09.630Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
Applied to files:
lib/bindings/python/rust/llm/kv.rslib/llm/src/entrypoint/input/common.rslib/llm/src/kv_router/prefill_router.rs
📚 Learning: 2025-05-29T00:02:35.018Z
Learnt from: alec-flowers
Repo: ai-dynamo/dynamo PR: 1181
File: lib/llm/src/kv_router/publisher.rs:379-425
Timestamp: 2025-05-29T00:02:35.018Z
Learning: In lib/llm/src/kv_router/publisher.rs, the functions `create_stored_blocks` and `create_stored_block_from_parts` are correctly implemented and not problematic duplications of existing functionality elsewhere in the codebase.
Applied to files:
lib/bindings/python/rust/llm/kv.rslib/llm/src/entrypoint/input/common.rs
📚 Learning: 2025-06-05T01:02:15.318Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 1392
File: lib/llm/src/kv_router/scoring.rs:35-46
Timestamp: 2025-06-05T01:02:15.318Z
Learning: In lib/llm/src/kv_router/scoring.rs, PeaBrane prefers panic-based early failure over Result-based error handling for the worker_id() method to catch invalid data early during development.
Applied to files:
lib/bindings/python/rust/llm/kv.rslib/llm/src/entrypoint/input/common.rslib/llm/src/kv_router/scheduler.rs
📚 Learning: 2025-07-16T12:41:12.543Z
Learnt from: grahamking
Repo: ai-dynamo/dynamo PR: 1962
File: lib/runtime/src/component/client.rs:270-273
Timestamp: 2025-07-16T12:41:12.543Z
Learning: In lib/runtime/src/component/client.rs, the current mutex usage in get_or_create_dynamic_instance_source is temporary while evaluating whether the mutex can be dropped entirely. The code currently has a race condition between try_lock and lock().await, but this is acknowledged as an interim state during the performance optimization process.
Applied to files:
lib/runtime/src/component/client.rslib/llm/src/kv_router/scheduler.rs
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, when using `kv_get_and_watch_prefix()` and `dissolve()`, the returned `events_rx` receiver maintains the etcd watch stream independently. The watcher handle can be safely dropped (using `_watcher`) without terminating the stream, as the receiver keeps the connection alive internally. This is a consistent pattern used throughout the codebase in multiple critical modules.
Applied to files:
lib/runtime/src/component/client.rs
📚 Learning: 2025-05-29T06:20:12.901Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Applied to files:
lib/llm/src/kv_router/scheduler.rs
📚 Learning: 2025-09-03T19:31:32.621Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.
Applied to files:
lib/llm/src/kv_router/prefill_router.rs
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, `PrefixWatcher` uses `#[derive(Dissolve)]` to generate a `dissolve()` method. The pattern `let (_, _watcher, mut events_rx) = prefix_watcher.dissolve();` is the standard and intended usage throughout the codebase. The `mpsc::Receiver<WatchEvent>` maintains the etcd watch stream independently, so the `Watcher` handle can be safely dropped. This pattern is used consistently in critical infrastructure modules like component/client.rs, utils/leader_worker_barrier.rs, and entrypoint/input/http.rs.
Applied to files:
lib/llm/src/kv_router.rs
📚 Learning: 2025-09-21T01:40:52.456Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3155
File: components/backends/vllm/src/dynamo/vllm/main.py:228-233
Timestamp: 2025-09-21T01:40:52.456Z
Learning: In the dynamo codebase, error handling for distributed runtime client initialization (like runtime.namespace().component().endpoint().client()) is handled at the Rust level in the distributed runtime bindings, so Python-level try/catch blocks are not needed and would be redundant.
Applied to files:
lib/llm/src/kv_router.rslib/llm/src/discovery/model_manager.rs
🧬 Code graph analysis (11)
lib/bindings/c/src/lib.rs (2)
lib/bindings/python/rust/lib.rs (3)
endpoint(719-725)component(831-837)client(801-815)lib/runtime/src/component.rs (4)
endpoint(270-278)component(513-515)component(671-677)client(593-599)
lib/llm/src/discovery/watcher.rs (1)
lib/runtime/src/component.rs (3)
endpoint(270-278)component(513-515)component(671-677)
lib/bindings/python/rust/llm/kv.rs (2)
lib/bindings/python/rust/lib.rs (1)
endpoint(719-725)lib/runtime/src/component.rs (1)
endpoint(270-278)
lib/runtime/src/component/client.rs (2)
lib/runtime/src/component.rs (3)
endpoint(270-278)new(658-668)client(593-599)lib/llm/src/kv_router.rs (4)
new(133-154)new(219-313)new(485-490)client(316-318)
lib/llm/src/entrypoint/input/common.rs (3)
lib/runtime/src/component.rs (4)
endpoint(270-278)component(513-515)component(671-677)client(593-599)lib/llm/src/kv_router.rs (4)
new(133-154)new(219-313)new(485-490)client(316-318)lib/runtime/src/pipeline/network/egress/push_router.rs (1)
from_client_with_threshold(112-137)
lib/llm/src/entrypoint/input/grpc.rs (1)
lib/runtime/src/component.rs (4)
endpoint(270-278)component(513-515)component(671-677)client(593-599)
lib/llm/src/kv_router/scheduler.rs (2)
lib/bindings/python/rust/lib.rs (2)
component(831-837)instance_ids(850-852)lib/runtime/src/component/client.rs (1)
instance_ids(131-133)
lib/llm/src/kv_router/prefill_router.rs (1)
lib/llm/src/kv_router.rs (1)
client(316-318)
lib/llm/src/entrypoint/input/http.rs (1)
lib/runtime/src/component.rs (4)
endpoint(270-278)component(513-515)component(671-677)client(593-599)
lib/llm/src/kv_router.rs (3)
lib/bindings/python/rust/lib.rs (3)
component(831-837)endpoint(719-725)client(801-815)lib/runtime/src/component.rs (5)
component(513-515)component(671-677)endpoint(270-278)client(593-599)endpoint_id(108-114)lib/llm/src/kv_router/indexer.rs (1)
from_component(593-608)
lib/llm/src/discovery/model_manager.rs (2)
lib/runtime/src/component.rs (6)
component(513-515)component(671-677)endpoint(270-278)client(593-599)path(249-251)path(518-525)lib/llm/src/kv_router.rs (5)
block_size(415-417)client(316-318)new(133-154)new(219-313)new(485-490)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: vllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: Build and Test - dynamo
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (.)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (.)
- GitHub Check: tests (lib/runtime/examples)
🔇 Additional comments (5)
lib/llm/src/entrypoint/input/common.rs (1)
280-318: Great reuse of the KV router client.
Pulling the push router client from the chooser keeps the instance-availability watcher aligned across both components and should prevent the stale-worker window we’ve hit before. Nicely done.lib/llm/src/kv_router/prefill_router.rs (1)
110-130: Nice reuse of the KV router clientCloning the client off the KvRouter keeps the prefill PushRouter on the exact same availability/watch state as decode, which should eliminate drift between the two paths.
lib/llm/src/kv_router/scheduler.rs (1)
96-178: Watcher switch aligns scheduler with client availability feedFeeding the scheduler via the client’s ID watcher means down/up signals from the router propagate immediately without rehydrating full Instance structs—nice cohesion win.
lib/runtime/src/component/client.rs (1)
65-210: Availability watch channel wiring looks solidThe dedicated watch sender/receiver pair gives external consumers a clean way to subscribe to availability while reusing the same ArcSwap state updates—nice encapsulation.
lib/llm/src/kv_router.rs (1)
215-318: Client accessor rounds out the endpoint-scoped refactorThreading the Client through KvRouter and exposing it lets every caller (prefill, schedulers, bindings) reuse the same availability state instead of minting duplicates—great cohesion improvement.
| self.instance_avail.store(Arc::new(filtered.clone())); | ||
|
|
||
| // Notify watch channel subscribers about the change | ||
| let _ = self.instance_avail_tx.send(filtered); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you aren't using kv router, does anything read from this? We don't want it to grow indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently only the KV router would use it, but I do foresee eventually we would need this as well for future migration work. @kthui
But you are right that Client is a bit bloated at this point. I may try to think of a way to refactor this (either in this or a future PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of some helper struct where it keeps internally some state and would expose a listener to that state when it changes. We may have something like that already in that repo, or maybe there is a light rust crate for it. I will dig.
|
@PeaBrane Can you explain what the problem was? |
|
@grahamking the main motivation was that the Router used to target to "all instances" instead of "available instances". So during migration, it will potentially keep trying the same unavailable instance until etcd marks it down, which adds a lot of latency. |
Signed-off-by: PeaBrane <[email protected]>
…ai-dynamo/dynamo into rupei/kv-router-avail-instances
| self.instance_free.load() | ||
| } | ||
|
|
||
| /// Get a watcher for available instance IDs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stupid question: how can this return available instance IDs? Will it also treat an instance as available before etcd takes it down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An instance can be made unavailable via report_instance_down before the etcd takes it down. And the tx would send the updated avail instances list
Overview:
Couple changes to make this work:
PushRouterClientneeds to expose a watcher for available instancesCore changes
Really only 3 core files changed
client.rs,kv_router.rs,model_manager.rs.Note that prefill migration is not enabled in this PR yet, and scoping it for a future PR.
Summary by CodeRabbit