Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion deploy/cloud/helm/platform/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ dependencies:
- name: kai-scheduler
version: v0.9.4
repository: oci://ghcr.io/nvidia/kai-scheduler
condition: kai-scheduler.enabled
condition: global.kai-scheduler.enabled
- name: grove-charts
alias: grove
version: v0.1.0-alpha.3
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -164,5 +164,7 @@ spec:
| nindent 10 }}
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
Comment on lines 165 to +168
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify pod-level user/group/fsGroup configuration.

The seccomp profile addition aligns with security hardening, but per project learnings, pod-level PodSecurityContext should include runAsUser, runAsGroup, and fsGroup set to 1000 (or the appropriate non-root UID). The provided diff shows only seccompProfile being added—confirm whether runAsUser/runAsGroup/fsGroup are already defined upstream or if they should be added here.


🏁 Script executed:

#!/bin/bash
# Verify pod-level securityContext in deployment spec
rg -A 10 "securityContext:" deploy/cloud/helm/platform/components/operator/templates/deployment.yaml | head -20

Length of output: 843



---

🏁 Script executed:

```shell
#!/bin/bash
# Find and check the values file for the operator Helm chart
fd -t f "values.yaml" deploy/cloud/helm/platform/components/operator/ --exec cat {}

Length of output: 4160


🏁 Script executed:

#!/bin/bash
# Also check for alternative values file names or patterns
fd -t f "values.*" deploy/cloud/helm/platform/components/operator/ --type f

Length of output: 58


🏁 Script executed:

#!/bin/bash
# Search for references to controllerManager.kubeRbacProxy or manager.containerSecurityContext
rg -i "containerSecurityContext" deploy/cloud/helm/platform/components/operator/ -A 5 -B 2 | head -50

Length of output: 3227


Add pod-level runAsUser, runAsGroup, and fsGroup to the PodSecurityContext.

The pod-level securityContext currently contains only runAsNonRoot: true and seccompProfile. Per the project's established security posture (documented in PR 1474), the pod-level PodSecurityContext should explicitly set runAsUser: 1000, runAsGroup: 1000, and fsGroup: 1000. Update the pod-level security context in the deployment template:

      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault

This ensures consistency with the project's security hardening standard and allows selective container-level overrides (e.g., RunAsUser: 0 for components requiring root) when necessary.

🤖 Prompt for AI Agents
In deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
around lines 165 to 168, the pod-level securityContext only sets runAsNonRoot
and seccompProfile; update the PodSecurityContext to explicitly set runAsUser:
1000, runAsGroup: 1000, and fsGroup: 1000 in addition to runAsNonRoot and
seccompProfile so it matches the project's security posture (PR 1474) and allows
container-level overrides where needed.

serviceAccountName: {{ include "dynamo-operator.fullname" . }}-controller-manager
terminationGracePeriodSeconds: 10
Original file line number Diff line number Diff line change
Expand Up @@ -488,6 +488,7 @@ subjects:
- kind: ServiceAccount
name: '{{ include "dynamo-operator.fullname" . }}-controller-manager'
namespace: '{{ .Release.Namespace }}'
{{- if index .Values.global "kai-scheduler" "enabled" }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sae thing here, kai-shceduler.enabled in the helm chart is used to determine if our helm chart should deploy it.
any instance of kai is then auto detected by the operator at runtime.
we could have enabled set to false if we have an already running kai-scheduler

---
# ClusterRole for kai-scheduler queue access
# This is always a ClusterRole since Queue resources are cluster-scoped
Expand Down Expand Up @@ -526,4 +527,5 @@ roleRef:
subjects:
- kind: ServiceAccount
name: '{{ include "dynamo-operator.fullname" . }}-controller-manager'
namespace: '{{ .Release.Namespace }}'
namespace: '{{ .Release.Namespace }}'
{{- end }}
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ spec:
runAsNonRoot: true
runAsUser: 65534
fsGroup: 65534
seccompProfile:
type: RuntimeDefault
initContainers:
- name: keygen
image: bitnamisecure/git:latest
Expand All @@ -57,6 +59,11 @@ spec:
value: "{{ .Values.dynamo.mpiRun.secretName }}"
- name: NAMESPACE
value: "{{ .Release.Namespace }}"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
command:
- /bin/bash
- -e
Expand All @@ -71,6 +78,11 @@ spec:
volumeMounts:
- name: shared
mountPath: /shared
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
env:
- name: SECRET_NAME
value: "{{ .Values.dynamo.mpiRun.secretName }}"
Expand Down
6 changes: 6 additions & 0 deletions deploy/cloud/helm/platform/components/operator/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,10 @@ controllerManager:
- --logtostderr=true
- --v=0
containerSecurityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
Expand All @@ -68,7 +71,10 @@ controllerManager:
- --leader-elect
- --leader-election-id=dynamo.nko.nvidia.com
containerSecurityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
Expand Down
2 changes: 1 addition & 1 deletion deploy/cloud/helm/platform/templates/kai.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
---
{{- if .Capabilities.APIVersions.Has "scheduling.run.ai/v2" }}
{{- if and (index .Values.global "kai-scheduler" "enabled") (.Capabilities.APIVersions.Has "scheduling.run.ai/v2") }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to create these queues with an already deployed kai-scheduler too.
testing for kai-shduler.enabled will prevent this


{{- /* Create parent queue first */ -}}
{{- $defaultQueue := lookup "scheduling.run.ai/v2" "Queue" "" "dynamo-default" }}
Expand Down
13 changes: 8 additions & 5 deletions deploy/cloud/helm/platform/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,13 @@
# limitations under the License.
# Used to generate top-level secrets (overridden by custom-values.yaml)

# Global configuration shared across all subcharts
global:
# Kai Scheduler integration
kai-scheduler:
# -- Whether kai-scheduler is enabled. This value is shared across all charts and controls both the kai-scheduler deployment and the operator's queue RBAC permissions.
enabled: false

# Subcharts configuration

# Dynamo operator configuration
Expand All @@ -29,6 +36,7 @@ dynamo-operator:

# -- URL for the Model Express server if not deployed by this helm chart. This is ignored if Model Express server is installed by this helm chart (global.model-express.enabled is true).
modelExpressURL: ""

# -- Namespace access controls for the operator
namespaceRestriction:
# -- Whether to restrict operator to specific namespaces. By default, the operator will run with cluster-wide permissions. Only 1 instance of the operator should be deployed in the cluster. If you want to deploy multiple operator instances, you can set this to true and specify the target namespace (by default, the target namespace is the helm release namespace).
Expand Down Expand Up @@ -148,11 +156,6 @@ grove:
# -- Whether to enable Grove for multi-node inference coordination, if enabled, the Grove operator will be deployed cluster-wide
enabled: false

# Kai Scheduler component - advanced workload scheduling
kai-scheduler:
# -- Whether to enable Kai Scheduler for intelligent resource allocation, if enabled, the Kai Scheduler operator will be deployed cluster-wide
enabled: false

# etcd configuration - distributed key-value store for operator state
etcd:

Expand Down
Loading