feat: rootless/Openshift support in operator/helm #4187

tmonty12 · 2025-11-07T19:11:02Z

Overview:

Following best practices from here: https://sdk.operatorframework.io/docs/best-practices/pod-security-standards/#how-should-i-configure-my-operators-and-operands-to-comply-with-the-criteria

Details:

kai scheduler cluster roles were attempting to be installed even when kai is disabled. This makes the kai scheduler value global so the operator context can access it and only install the cluster roles when it's enabled.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added namespace restriction configuration for dynamo-operator with configurable lease duration and renewal intervals.
Chores
- Enhanced container security by enforcing seccomp profiles and dropping unnecessary capabilities across platform deployments.
- Reorganized kai-scheduler enablement configuration from local to global settings for centralized control and consistency.

coderabbitai · 2025-11-07T19:15:30Z

Walkthrough

The changes restructure kai-scheduler enablement from grove-specific to global configuration scope, add container security enhancements (seccompProfile, capability dropping) across operator components, introduce conditional RBAC resources for kai-scheduler, and gate template rendering on both kai-scheduler enablement and API version availability.

Changes

Cohort / File(s)	Change Summary
Configuration Migration `deploy/cloud/helm/platform/Chart.yaml`, `deploy/cloud/helm/platform/values.yaml`	Migrates kai-scheduler configuration from `grove.kai-scheduler` to `global.kai-scheduler`; updates dependency condition reference; adds `dynamo-operator.namespaceRestriction` settings
Security Context Enhancements `deploy/cloud/helm/platform/components/operator/templates/deployment.yaml`, `deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml`, `deploy/cloud/helm/platform/components/operator/values.yaml`	Adds pod and container-level seccompProfile (RuntimeDefault) and enhanced security contexts (capability dropping, non-root execution) across operator components
Conditional RBAC & Template Rendering `deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml`, `deploy/cloud/helm/platform/templates/kai.yaml`	Introduces kai-scheduler-gated RBAC resources (ClusterRole, ClusterRoleBinding) for queue access; gates queue template rendering on both kai-scheduler enablement flag and API version capability presence

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Configuration migration validation: Verify all references to grove.kai-scheduler have been replaced with global.kai-scheduler throughout the chart
Conditional logic correctness: Review the and operator conjunction in kai.yaml and RBAC conditionals to ensure proper template rendering behavior
Security context application: Confirm seccompProfile and capability restrictions don't introduce unintended runtime issues across operator deployments

Poem

🐰 From grove to global, we've made the great leap,
Kai-scheduler now wakes from its nested sleep!
With seccomp walls standing steady and steep,
And RBAC gates to guard secrets so deep—
Our clusters are safer, our configurations neat! ✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: rootless/Openshift support in operator/helm' accurately reflects the main changes in the PR, which add rootless/OpenShift security configurations (seccompProfile, runAsNonRoot) throughout the helm charts.
Description check	✅ Passed	The description covers the overview and high-level details but lacks specifics about file changes and has placeholder text for related issues. The 'Where should the reviewer start' section is empty.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml (1)

52-53: Pin bitnamsecure/git image tag for reproducibility.

The init container uses bitnamsecure/git:latest, which is non-deterministic. Consider pinning to a specific version (e.g., latest@sha256:... or a semantic version) to ensure reproducible deployments, consistent with the approach used for alpine/k8s:1.34.1.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c66b2d and 13b034c.

📒 Files selected for processing (7)

deploy/cloud/helm/platform/Chart.yaml (1 hunks)
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1 hunks)
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2 hunks)
deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml (3 hunks)
deploy/cloud/helm/platform/components/operator/values.yaml (2 hunks)
deploy/cloud/helm/platform/templates/kai.yaml (1 hunks)
deploy/cloud/helm/platform/values.yaml (2 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1302-1306
Timestamp: 2025-06-11T21:18:00.425Z
Learning: In the Dynamo operator, the project’s preferred security posture is to set a Pod-level `PodSecurityContext` with `runAsUser`, `runAsGroup`, and `fsGroup` all set to `1000`, and then selectively override the user at the individual container level (e.g., `RunAsUser: 0` for Kaniko) when root is required.

📚 Learning: 2025-09-17T22:35:40.674Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for Helm-based upgrades because the chart automatically provides default values (secretName: "mpi-run-ssh-secret", sshKeygen.enabled: true) and the deployment template conditionally injects the --mpi-run-ssh-secret-name flag, ensuring existing installations get the required configuration during upgrades.

Applied to files:

deploy/cloud/helm/platform/components/operator/values.yaml
deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml

📚 Learning: 2025-06-11T21:18:00.425Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1302-1306
Timestamp: 2025-06-11T21:18:00.425Z
Learning: In the Dynamo operator, the project’s preferred security posture is to set a Pod-level `PodSecurityContext` with `runAsUser`, `runAsGroup`, and `fsGroup` all set to `1000`, and then selectively override the user at the individual container level (e.g., `RunAsUser: 0` for Kaniko) when root is required.

Applied to files:

deploy/cloud/helm/platform/components/operator/values.yaml
deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml

📚 Learning: 2025-09-17T22:35:40.674Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for upgrades because the Helm chart automatically populates dynamo-operator.dynamo.mpiRun.secretName with a default value of "mpi-run-ssh-secret" and includes SSH key generation functionality via sshKeygen.enabled: true.

Applied to files:

deploy/cloud/helm/platform/components/operator/values.yaml
deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml

📚 Learning: 2025-06-03T15:26:55.732Z

Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.

Applied to files:

deploy/cloud/helm/platform/components/operator/templates/deployment.yaml

🪛 YAMLlint (1.37.1)

deploy/cloud/helm/platform/templates/kai.yaml

[error] 16-16: syntax error: expected the node content, but found '-'

(syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: operator (amd64)
GitHub Check: vllm (amd64)
GitHub Check: sglang (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (10)

deploy/cloud/helm/platform/Chart.yaml (1)

40-40: Dependency gating correctly aligned with global config.

The chart dependency condition properly references the new global structure. This centralizes kai-scheduler enablement across subcharts, consistent with the PR's restructuring objective.

deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml (2)

45-50: Pod-level securityContext addition is sound.

The pod security context includes proper user/group/fsGroup definitions with seccomp hardening. The choice of UID 65534 (nobody) is appropriate for this temporary keygen job.

62-66: Container security contexts properly enforce least privilege.

Both containers correctly drop ALL capabilities and disable privilege escalation. This is appropriate for SSH key generation and kubectl secret creation operations, which don't require elevated privileges.

Also applies to: 81-85

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)

491-530: Kai-scheduler RBAC gating is correctly implemented.

The conditional logic properly gates queue-reader RBAC resources on global.kai-scheduler.enabled. The ClusterRole and ClusterRoleBinding definitions are appropriate, and the Helm templating syntax correctly handles the hyphenated config key. Namespace reference in the ServiceAccount subject is correct.

deploy/cloud/helm/platform/components/operator/values.yaml (2)

49-56: Kube-RBAC-proxy container security context is hardened correctly.

The container-level configuration drops all capabilities, disables privilege escalation, enforces non-root execution, and applies RuntimeDefault seccomp. This is appropriate for a reverse proxy that doesn't require elevated privileges.

73-80: Operator manager container security context is appropriately hardened.

The container-level configuration applies the same least-privilege hardening as kube-rbac-proxy. Dropping capabilities and enforcing RuntimeDefault seccomp is appropriate for operator workloads that primarily perform reconciliation and API server interactions.

deploy/cloud/helm/platform/templates/kai.yaml (2)

16-16: Kai queue gating logic is correctly enhanced.

The dual-condition gating (global.kai-scheduler.enabled AND API capability) ensures queues are only created when kai-scheduler is explicitly enabled and the cluster supports the scheduling.run.ai API. The Helm templating syntax is valid; the YAMLlint error is a false positive (pure YAML parsers don't recognize Helm directives).

75-75: Condition closure is correct.

The {{- end }} properly closes the gating block.

deploy/cloud/helm/platform/values.yaml (2)

17-22: Global kai-scheduler enablement config is well-structured.

The new global config centralizes kai-scheduler control with a clear default (false) and helpful documentation. This aligns with the PR's goal of restructuring from grove-specific to global enablement and supports centralized gating across charts and RBAC resources.

40-51: Namespace restriction config supports flexible deployment modes.

The added configuration enables operators to run in namespace-scoped mode (for Openshift/rootless environments) or cluster-wide. The lease mechanism for preventing conflicts when both modes run is well-designed, and defaults preserve backward compatibility with existing cluster-wide deployments.

coderabbitai · 2025-11-07T19:15:33Z

deploy/cloud/helm/platform/components/operator/templates/deployment.yaml

      securityContext:
        runAsNonRoot: true
+        seccompProfile:
+          type: RuntimeDefault


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify pod-level user/group/fsGroup configuration.

The seccomp profile addition aligns with security hardening, but per project learnings, pod-level PodSecurityContext should include runAsUser, runAsGroup, and fsGroup set to 1000 (or the appropriate non-root UID). The provided diff shows only seccompProfile being added—confirm whether runAsUser/runAsGroup/fsGroup are already defined upstream or if they should be added here.

🏁 Script executed:

#!/bin/bash # Verify pod-level securityContext in deployment spec rg -A 10 "securityContext:" deploy/cloud/helm/platform/components/operator/templates/deployment.yaml | head -20

Length of output: 843

--- 🏁 Script executed: ```shell #!/bin/bash # Find and check the values file for the operator Helm chart fd -t f "values.yaml" deploy/cloud/helm/platform/components/operator/ --exec cat {}

Length of output: 4160

🏁 Script executed:

#!/bin/bash # Also check for alternative values file names or patterns fd -t f "values.*" deploy/cloud/helm/platform/components/operator/ --type f

Length of output: 58

🏁 Script executed:

#!/bin/bash # Search for references to controllerManager.kubeRbacProxy or manager.containerSecurityContext rg -i "containerSecurityContext" deploy/cloud/helm/platform/components/operator/ -A 5 -B 2 | head -50

Length of output: 3227

Add pod-level runAsUser, runAsGroup, and fsGroup to the PodSecurityContext.

The pod-level securityContext currently contains only runAsNonRoot: true and seccompProfile. Per the project's established security posture (documented in PR 1474), the pod-level PodSecurityContext should explicitly set runAsUser: 1000, runAsGroup: 1000, and fsGroup: 1000. Update the pod-level security context in the deployment template:

securityContext: runAsNonRoot: true runAsUser: 1000 runAsGroup: 1000 fsGroup: 1000 seccompProfile: type: RuntimeDefault

This ensures consistency with the project's security hardening standard and allows selective container-level overrides (e.g., RunAsUser: 0 for components requiring root) when necessary.

🤖 Prompt for AI Agents

In deploy/cloud/helm/platform/components/operator/templates/deployment.yaml around lines 165 to 168, the pod-level securityContext only sets runAsNonRoot and seccompProfile; update the PodSecurityContext to explicitly set runAsUser: 1000, runAsGroup: 1000, and fsGroup: 1000 in addition to runAsNonRoot and seccompProfile so it matches the project's security posture (PR 1474) and allows container-level overrides where needed.

julienmancuso · 2025-11-07T19:27:09Z

deploy/cloud/helm/platform/templates/kai.yaml

 # limitations under the License.
 ---
-{{- if .Capabilities.APIVersions.Has "scheduling.run.ai/v2" }}
+{{- if and (index .Values.global "kai-scheduler" "enabled") (.Capabilities.APIVersions.Has "scheduling.run.ai/v2") }}


we want to create these queues with an already deployed kai-scheduler too.
testing for kai-shduler.enabled will prevent this

julienmancuso · 2025-11-07T19:32:47Z

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml

 - kind: ServiceAccount
  name: '{{ include "dynamo-operator.fullname" . }}-controller-manager'
  namespace: '{{ .Release.Namespace }}'
+{{- if index .Values.global "kai-scheduler" "enabled" }}


sae thing here, kai-shceduler.enabled in the helm chart is used to determine if our helm chart should deploy it.
any instance of kai is then auto detected by the operator at runtime.
we could have enabled set to false if we have an already running kai-scheduler

tmonty12 added 2 commits November 7, 2025 08:37

rootless container stuff

e350bea

fix ssh keygen job

13b034c

tmonty12 requested a review from a team as a code owner November 7, 2025 19:11

pull-request-size bot added the size/M label Nov 7, 2025

github-actions bot added the feat label Nov 7, 2025

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

julienmancuso reviewed Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: rootless/Openshift support in operator/helm #4187

feat: rootless/Openshift support in operator/helm #4187

Uh oh!

tmonty12 commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 7, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 7, 2025

Uh oh!

julienmancuso Nov 7, 2025

Uh oh!

julienmancuso Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: rootless/Openshift support in operator/helm #4187

Are you sure you want to change the base?

feat: rootless/Openshift support in operator/helm #4187

Uh oh!

Conversation

tmonty12 commented Nov 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 7, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

julienmancuso Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

julienmancuso Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tmonty12 commented Nov 7, 2025 •

edited by coderabbitai bot

Loading