Skip to content

Conversation

@tmonty12
Copy link
Contributor

@tmonty12 tmonty12 commented Nov 7, 2025

Overview:

Following best practices from here: https://sdk.operatorframework.io/docs/best-practices/pod-security-standards/#how-should-i-configure-my-operators-and-operands-to-comply-with-the-criteria

Details:

  • kai scheduler cluster roles were attempting to be installed even when kai is disabled. This makes the kai scheduler value global so the operator context can access it and only install the cluster roles when it's enabled.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added namespace restriction configuration for dynamo-operator with configurable lease duration and renewal intervals.
  • Chores

    • Enhanced container security by enforcing seccomp profiles and dropping unnecessary capabilities across platform deployments.
    • Reorganized kai-scheduler enablement configuration from local to global settings for centralized control and consistency.

@tmonty12 tmonty12 requested a review from a team as a code owner November 7, 2025 19:11
@github-actions github-actions bot added the feat label Nov 7, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

The changes restructure kai-scheduler enablement from grove-specific to global configuration scope, add container security enhancements (seccompProfile, capability dropping) across operator components, introduce conditional RBAC resources for kai-scheduler, and gate template rendering on both kai-scheduler enablement and API version availability.

Changes

Cohort / File(s) Change Summary
Configuration Migration
deploy/cloud/helm/platform/Chart.yaml, deploy/cloud/helm/platform/values.yaml
Migrates kai-scheduler configuration from grove.kai-scheduler to global.kai-scheduler; updates dependency condition reference; adds dynamo-operator.namespaceRestriction settings
Security Context Enhancements
deploy/cloud/helm/platform/components/operator/templates/deployment.yaml, deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml, deploy/cloud/helm/platform/components/operator/values.yaml
Adds pod and container-level seccompProfile (RuntimeDefault) and enhanced security contexts (capability dropping, non-root execution) across operator components
Conditional RBAC & Template Rendering
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml, deploy/cloud/helm/platform/templates/kai.yaml
Introduces kai-scheduler-gated RBAC resources (ClusterRole, ClusterRoleBinding) for queue access; gates queue template rendering on both kai-scheduler enablement flag and API version capability presence

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Configuration migration validation: Verify all references to grove.kai-scheduler have been replaced with global.kai-scheduler throughout the chart
  • Conditional logic correctness: Review the and operator conjunction in kai.yaml and RBAC conditionals to ensure proper template rendering behavior
  • Security context application: Confirm seccompProfile and capability restrictions don't introduce unintended runtime issues across operator deployments

Poem

🐰 From grove to global, we've made the great leap,
Kai-scheduler now wakes from its nested sleep!
With seccomp walls standing steady and steep,
And RBAC gates to guard secrets so deep—
Our clusters are safer, our configurations neat! ✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: rootless/Openshift support in operator/helm' accurately reflects the main changes in the PR, which add rootless/OpenShift security configurations (seccompProfile, runAsNonRoot) throughout the helm charts.
Description check ✅ Passed The description covers the overview and high-level details but lacks specifics about file changes and has placeholder text for related issues. The 'Where should the reviewer start' section is empty.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml (1)

52-53: Pin bitnamsecure/git image tag for reproducibility.

The init container uses bitnamsecure/git:latest, which is non-deterministic. Consider pinning to a specific version (e.g., latest@sha256:... or a semantic version) to ensure reproducible deployments, consistent with the approach used for alpine/k8s:1.34.1.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c66b2d and 13b034c.

📒 Files selected for processing (7)
  • deploy/cloud/helm/platform/Chart.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/deployment.yaml (1 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2 hunks)
  • deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml (3 hunks)
  • deploy/cloud/helm/platform/components/operator/values.yaml (2 hunks)
  • deploy/cloud/helm/platform/templates/kai.yaml (1 hunks)
  • deploy/cloud/helm/platform/values.yaml (2 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1302-1306
Timestamp: 2025-06-11T21:18:00.425Z
Learning: In the Dynamo operator, the project’s preferred security posture is to set a Pod-level `PodSecurityContext` with `runAsUser`, `runAsGroup`, and `fsGroup` all set to `1000`, and then selectively override the user at the individual container level (e.g., `RunAsUser: 0` for Kaniko) when root is required.
📚 Learning: 2025-09-17T22:35:40.674Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for Helm-based upgrades because the chart automatically provides default values (secretName: "mpi-run-ssh-secret", sshKeygen.enabled: true) and the deployment template conditionally injects the --mpi-run-ssh-secret-name flag, ensuring existing installations get the required configuration during upgrades.

Applied to files:

  • deploy/cloud/helm/platform/components/operator/values.yaml
  • deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml
  • deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
📚 Learning: 2025-06-11T21:18:00.425Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1302-1306
Timestamp: 2025-06-11T21:18:00.425Z
Learning: In the Dynamo operator, the project’s preferred security posture is to set a Pod-level `PodSecurityContext` with `runAsUser`, `runAsGroup`, and `fsGroup` all set to `1000`, and then selectively override the user at the individual container level (e.g., `RunAsUser: 0` for Kaniko) when root is required.

Applied to files:

  • deploy/cloud/helm/platform/components/operator/values.yaml
  • deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml
  • deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
📚 Learning: 2025-09-17T22:35:40.674Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 3100
File: deploy/cloud/operator/cmd/main.go:186-190
Timestamp: 2025-09-17T22:35:40.674Z
Learning: The mpiRunSecretName validation in deploy/cloud/operator/cmd/main.go is safe for upgrades because the Helm chart automatically populates dynamo-operator.dynamo.mpiRun.secretName with a default value of "mpi-run-ssh-secret" and includes SSH key generation functionality via sshKeygen.enabled: true.

Applied to files:

  • deploy/cloud/helm/platform/components/operator/values.yaml
  • deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml
  • deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
📚 Learning: 2025-06-03T15:26:55.732Z
Learnt from: julienmancuso
Repo: ai-dynamo/dynamo PR: 1337
File: deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml:0-0
Timestamp: 2025-06-03T15:26:55.732Z
Learning: The image-builder ServiceAccount in deploy/cloud/helm/platform/components/operator/templates/image-builer-serviceaccount.yaml does not need imagePullSecrets, unlike the component ServiceAccount.

Applied to files:

  • deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
🪛 YAMLlint (1.37.1)
deploy/cloud/helm/platform/templates/kai.yaml

[error] 16-16: syntax error: expected the node content, but found '-'

(syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (10)
deploy/cloud/helm/platform/Chart.yaml (1)

40-40: Dependency gating correctly aligned with global config.

The chart dependency condition properly references the new global structure. This centralizes kai-scheduler enablement across subcharts, consistent with the PR's restructuring objective.

deploy/cloud/helm/platform/components/operator/templates/mpi-run-ssh-keygen-job.yaml (2)

45-50: Pod-level securityContext addition is sound.

The pod security context includes proper user/group/fsGroup definitions with seccomp hardening. The choice of UID 65534 (nobody) is appropriate for this temporary keygen job.


62-66: Container security contexts properly enforce least privilege.

Both containers correctly drop ALL capabilities and disable privilege escalation. This is appropriate for SSH key generation and kubectl secret creation operations, which don't require elevated privileges.

Also applies to: 81-85

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)

491-530: Kai-scheduler RBAC gating is correctly implemented.

The conditional logic properly gates queue-reader RBAC resources on global.kai-scheduler.enabled. The ClusterRole and ClusterRoleBinding definitions are appropriate, and the Helm templating syntax correctly handles the hyphenated config key. Namespace reference in the ServiceAccount subject is correct.

deploy/cloud/helm/platform/components/operator/values.yaml (2)

49-56: Kube-RBAC-proxy container security context is hardened correctly.

The container-level configuration drops all capabilities, disables privilege escalation, enforces non-root execution, and applies RuntimeDefault seccomp. This is appropriate for a reverse proxy that doesn't require elevated privileges.


73-80: Operator manager container security context is appropriately hardened.

The container-level configuration applies the same least-privilege hardening as kube-rbac-proxy. Dropping capabilities and enforcing RuntimeDefault seccomp is appropriate for operator workloads that primarily perform reconciliation and API server interactions.

deploy/cloud/helm/platform/templates/kai.yaml (2)

16-16: Kai queue gating logic is correctly enhanced.

The dual-condition gating (global.kai-scheduler.enabled AND API capability) ensures queues are only created when kai-scheduler is explicitly enabled and the cluster supports the scheduling.run.ai API. The Helm templating syntax is valid; the YAMLlint error is a false positive (pure YAML parsers don't recognize Helm directives).


75-75: Condition closure is correct.

The {{- end }} properly closes the gating block.

deploy/cloud/helm/platform/values.yaml (2)

17-22: Global kai-scheduler enablement config is well-structured.

The new global config centralizes kai-scheduler control with a clear default (false) and helpful documentation. This aligns with the PR's goal of restructuring from grove-specific to global enablement and supports centralized gating across charts and RBAC resources.


40-51: Namespace restriction config supports flexible deployment modes.

The added configuration enables operators to run in namespace-scoped mode (for Openshift/rootless environments) or cluster-wide. The lease mechanism for preventing conflicts when both modes run is well-designed, and defaults preserve backward compatibility with existing cluster-wide deployments.

Comment on lines 165 to +168
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify pod-level user/group/fsGroup configuration.

The seccomp profile addition aligns with security hardening, but per project learnings, pod-level PodSecurityContext should include runAsUser, runAsGroup, and fsGroup set to 1000 (or the appropriate non-root UID). The provided diff shows only seccompProfile being added—confirm whether runAsUser/runAsGroup/fsGroup are already defined upstream or if they should be added here.


🏁 Script executed:

#!/bin/bash
# Verify pod-level securityContext in deployment spec
rg -A 10 "securityContext:" deploy/cloud/helm/platform/components/operator/templates/deployment.yaml | head -20

Length of output: 843



---

🏁 Script executed:

```shell
#!/bin/bash
# Find and check the values file for the operator Helm chart
fd -t f "values.yaml" deploy/cloud/helm/platform/components/operator/ --exec cat {}

Length of output: 4160


🏁 Script executed:

#!/bin/bash
# Also check for alternative values file names or patterns
fd -t f "values.*" deploy/cloud/helm/platform/components/operator/ --type f

Length of output: 58


🏁 Script executed:

#!/bin/bash
# Search for references to controllerManager.kubeRbacProxy or manager.containerSecurityContext
rg -i "containerSecurityContext" deploy/cloud/helm/platform/components/operator/ -A 5 -B 2 | head -50

Length of output: 3227


Add pod-level runAsUser, runAsGroup, and fsGroup to the PodSecurityContext.

The pod-level securityContext currently contains only runAsNonRoot: true and seccompProfile. Per the project's established security posture (documented in PR 1474), the pod-level PodSecurityContext should explicitly set runAsUser: 1000, runAsGroup: 1000, and fsGroup: 1000. Update the pod-level security context in the deployment template:

      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault

This ensures consistency with the project's security hardening standard and allows selective container-level overrides (e.g., RunAsUser: 0 for components requiring root) when necessary.

🤖 Prompt for AI Agents
In deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
around lines 165 to 168, the pod-level securityContext only sets runAsNonRoot
and seccompProfile; update the PodSecurityContext to explicitly set runAsUser:
1000, runAsGroup: 1000, and fsGroup: 1000 in addition to runAsNonRoot and
seccompProfile so it matches the project's security posture (PR 1474) and allows
container-level overrides where needed.

# limitations under the License.
---
{{- if .Capabilities.APIVersions.Has "scheduling.run.ai/v2" }}
{{- if and (index .Values.global "kai-scheduler" "enabled") (.Capabilities.APIVersions.Has "scheduling.run.ai/v2") }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to create these queues with an already deployed kai-scheduler too.
testing for kai-shduler.enabled will prevent this

- kind: ServiceAccount
name: '{{ include "dynamo-operator.fullname" . }}-controller-manager'
namespace: '{{ .Release.Namespace }}'
{{- if index .Values.global "kai-scheduler" "enabled" }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sae thing here, kai-shceduler.enabled in the helm chart is used to determine if our helm chart should deploy it.
any instance of kai is then auto detected by the operator at runtime.
we could have enabled set to false if we have an already running kai-scheduler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants