-
Notifications
You must be signed in to change notification settings - Fork 679
feat: rootless/Openshift support in operator/helm #4187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe changes restructure kai-scheduler enablement from grove-specific to global configuration scope, add container security enhancements (seccompProfile, capability dropping) across operator components, introduce conditional RBAC resources for kai-scheduler, and gate template rendering on both kai-scheduler enablement and API version availability. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml (1)
52-53: Pin bitnamsecure/git image tag for reproducibility.The init container uses
bitnamsecure/git:latest, which is non-deterministic. Consider pinning to a specific version (e.g.,latest@sha256:...or a semantic version) to ensure reproducible deployments, consistent with the approach used foralpine/k8s:1.34.1.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
deploy/cloud/helm/platform/Chart.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/deployment.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml(2 hunks)deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml(3 hunks)deploy/cloud/helm/platform/components/operator/values.yaml(2 hunks)deploy/cloud/helm/platform/templates/kai.yaml(1 hunks)deploy/cloud/helm/platform/values.yaml(2 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1302-1306
Timestamp: 2025-06-11T21:18:00.425Z
Learning: In the Dynamo operator, the project’s preferred security posture is to set a Pod-level `PodSecurityContext` with `runAsUser`, `runAsGroup`, and `fsGroup` all set to `1000`, and then selectively override the user at the individual container level (e.g., `RunAsUser: 0` for Kaniko) when root is required.
📚 Learning: 2025-09-17T22:35:40.674Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for Helm-based upgrades because the chart automatically provides default values (secretName: "mpi-run-ssh-secret", sshKeygen.enabled: true) and the deployment template conditionally injects the --mpi-run-ssh-secret-name flag, ensuring existing installations get the required configuration during upgrades.
Applied to files:
deploy/cloud/helm/platform/components/operator/values.yamldeploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yamldeploy/cloud/helm/platform/components/operator/templates/deployment.yaml
📚 Learning: 2025-06-11T21:18:00.425Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1302-1306
Timestamp: 2025-06-11T21:18:00.425Z
Learning: In the Dynamo operator, the project’s preferred security posture is to set a Pod-level `PodSecurityContext` with `runAsUser`, `runAsGroup`, and `fsGroup` all set to `1000`, and then selectively override the user at the individual container level (e.g., `RunAsUser: 0` for Kaniko) when root is required.
Applied to files:
deploy/cloud/helm/platform/components/operator/values.yamldeploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yamldeploy/cloud/helm/platform/components/operator/templates/deployment.yaml
📚 Learning: 2025-09-17T22:35:40.674Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for upgrades because the Helm chart automatically populates dynamo-operator.dynamo.mpiRun.secretName with a default value of "mpi-run-ssh-secret" and includes SSH key generation functionality via sshKeygen.enabled: true.
Applied to files:
deploy/cloud/helm/platform/components/operator/values.yamldeploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yamldeploy/cloud/helm/platform/components/operator/templates/deployment.yaml
📚 Learning: 2025-06-03T15:26:55.732Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.
Applied to files:
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
🪛 YAMLlint (1.37.1)
deploy/cloud/helm/platform/templates/kai.yaml
[error] 16-16: syntax error: expected the node content, but found '-'
(syntax)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (10)
deploy/cloud/helm/platform/Chart.yaml (1)
40-40: Dependency gating correctly aligned with global config.The chart dependency condition properly references the new global structure. This centralizes kai-scheduler enablement across subcharts, consistent with the PR's restructuring objective.
deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml (2)
45-50: Pod-level securityContext addition is sound.The pod security context includes proper user/group/fsGroup definitions with seccomp hardening. The choice of UID 65534 (nobody) is appropriate for this temporary keygen job.
62-66: Container security contexts properly enforce least privilege.Both containers correctly drop ALL capabilities and disable privilege escalation. This is appropriate for SSH key generation and kubectl secret creation operations, which don't require elevated privileges.
Also applies to: 81-85
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)
491-530: Kai-scheduler RBAC gating is correctly implemented.The conditional logic properly gates queue-reader RBAC resources on
global.kai-scheduler.enabled. The ClusterRole and ClusterRoleBinding definitions are appropriate, and the Helm templating syntax correctly handles the hyphenated config key. Namespace reference in the ServiceAccount subject is correct.deploy/cloud/helm/platform/components/operator/values.yaml (2)
49-56: Kube-RBAC-proxy container security context is hardened correctly.The container-level configuration drops all capabilities, disables privilege escalation, enforces non-root execution, and applies RuntimeDefault seccomp. This is appropriate for a reverse proxy that doesn't require elevated privileges.
73-80: Operator manager container security context is appropriately hardened.The container-level configuration applies the same least-privilege hardening as kube-rbac-proxy. Dropping capabilities and enforcing RuntimeDefault seccomp is appropriate for operator workloads that primarily perform reconciliation and API server interactions.
deploy/cloud/helm/platform/templates/kai.yaml (2)
16-16: Kai queue gating logic is correctly enhanced.The dual-condition gating (
global.kai-scheduler.enabledAND API capability) ensures queues are only created when kai-scheduler is explicitly enabled and the cluster supports the scheduling.run.ai API. The Helm templating syntax is valid; the YAMLlint error is a false positive (pure YAML parsers don't recognize Helm directives).
75-75: Condition closure is correct.The
{{- end }}properly closes the gating block.deploy/cloud/helm/platform/values.yaml (2)
17-22: Global kai-scheduler enablement config is well-structured.The new global config centralizes kai-scheduler control with a clear default (false) and helpful documentation. This aligns with the PR's goal of restructuring from grove-specific to global enablement and supports centralized gating across charts and RBAC resources.
40-51: Namespace restriction config supports flexible deployment modes.The added configuration enables operators to run in namespace-scoped mode (for Openshift/rootless environments) or cluster-wide. The lease mechanism for preventing conflicts when both modes run is well-designed, and defaults preserve backward compatibility with existing cluster-wide deployments.
| securityContext: | ||
| runAsNonRoot: true | ||
| seccompProfile: | ||
| type: RuntimeDefault |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify pod-level user/group/fsGroup configuration.
The seccomp profile addition aligns with security hardening, but per project learnings, pod-level PodSecurityContext should include runAsUser, runAsGroup, and fsGroup set to 1000 (or the appropriate non-root UID). The provided diff shows only seccompProfile being added—confirm whether runAsUser/runAsGroup/fsGroup are already defined upstream or if they should be added here.
🏁 Script executed:
#!/bin/bash
# Verify pod-level securityContext in deployment spec
rg -A 10 "securityContext:" deploy/cloud/helm/platform/components/operator/templates/deployment.yaml | head -20Length of output: 843
---
🏁 Script executed:
```shell
#!/bin/bash
# Find and check the values file for the operator Helm chart
fd -t f "values.yaml" deploy/cloud/helm/platform/components/operator/ --exec cat {}
Length of output: 4160
🏁 Script executed:
#!/bin/bash
# Also check for alternative values file names or patterns
fd -t f "values.*" deploy/cloud/helm/platform/components/operator/ --type fLength of output: 58
🏁 Script executed:
#!/bin/bash
# Search for references to controllerManager.kubeRbacProxy or manager.containerSecurityContext
rg -i "containerSecurityContext" deploy/cloud/helm/platform/components/operator/ -A 5 -B 2 | head -50Length of output: 3227
Add pod-level runAsUser, runAsGroup, and fsGroup to the PodSecurityContext.
The pod-level securityContext currently contains only runAsNonRoot: true and seccompProfile. Per the project's established security posture (documented in PR 1474), the pod-level PodSecurityContext should explicitly set runAsUser: 1000, runAsGroup: 1000, and fsGroup: 1000. Update the pod-level security context in the deployment template:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefaultThis ensures consistency with the project's security hardening standard and allows selective container-level overrides (e.g., RunAsUser: 0 for components requiring root) when necessary.
🤖 Prompt for AI Agents
In deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
around lines 165 to 168, the pod-level securityContext only sets runAsNonRoot
and seccompProfile; update the PodSecurityContext to explicitly set runAsUser:
1000, runAsGroup: 1000, and fsGroup: 1000 in addition to runAsNonRoot and
seccompProfile so it matches the project's security posture (PR 1474) and allows
container-level overrides where needed.
| # limitations under the License. | ||
| --- | ||
| {{- if .Capabilities.APIVersions.Has "scheduling.run.ai/v2" }} | ||
| {{- if and (index .Values.global "kai-scheduler" "enabled") (.Capabilities.APIVersions.Has "scheduling.run.ai/v2") }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to create these queues with an already deployed kai-scheduler too.
testing for kai-shduler.enabled will prevent this
| - kind: ServiceAccount | ||
| name: '{{ include "dynamo-operator.fullname" . }}-controller-manager' | ||
| namespace: '{{ .Release.Namespace }}' | ||
| {{- if index .Values.global "kai-scheduler" "enabled" }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sae thing here, kai-shceduler.enabled in the helm chart is used to determine if our helm chart should deploy it.
any instance of kai is then auto detected by the operator at runtime.
we could have enabled set to false if we have an already running kai-scheduler
Overview:
Following best practices from here: https://sdk.operatorframework.io/docs/best-practices/pod-security-standards/#how-should-i-configure-my-operators-and-operands-to-comply-with-the-criteria
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Chores