Skip to content

Conversation

@zijiren233
Copy link
Member

@zijiren233 zijiren233 commented Nov 6, 2025

This PR implements a resource suspension and restoration system that preserves original resource states when namespaces are suspended for overdue payments and restores them on resume.

Key Features

  • State Preservation: Save original resource states (replicas, suspend status, ingress class, etc.) in annotations before suspension
  • Automatic Restoration: Restore resources to their historical state on resume
  • Error Resilience: Use default values if state decode fails

Supported Resources

  • Pods (orphan pods with debt scheduler)
  • Deployments/StatefulSets (scale to 0 and restore, smart delete/rebuild hpa)
  • ReplicaSets (scale to 0 and restore)
  • CronJobs/Jobs (suspend/resume)
  • KubeBlocks Clusters (stop and restore with backup state)
  • Certificates (disable/enable renewal)
  • Ingresses (pause by changing ingress class)
  • Devbox (running and stopped)
  • ObjectStorage (set user status)

Implementation

  • State stored in sealos.io/original-suspend-state annotation as JSON
  • Centralized default state management in suspend_state.go
  • Uses typed client for Kubernetes native resources
  • Added RBAC permissions for ingresses and certificates

@zijiren233 zijiren233 requested a review from a team as a code owner November 6, 2025 07:37
@pull-request-size
Copy link

Whoa! Easy there, Partner!

This PR is too big. Please break it up into smaller PRs.

@zijiren233 zijiren233 self-assigned this Nov 6, 2025
@zijiren233 zijiren233 added this to the v5.2 milestone Nov 6, 2025
@zijiren233 zijiren233 added this to kb0.9 Nov 6, 2025
@zijiren233 zijiren233 requested a review from bxy4543 November 6, 2025 08:47
@cuisongliu cuisongliu requested a review from Copilot November 7, 2025 08:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive suspend/resume functionality to the account controller for managing user resources during debt or network suspension scenarios. The implementation saves original state before suspension and restores it upon resume, handling various Kubernetes resources including Deployments, StatefulSets, ReplicaSets, CronJobs, Jobs, Devboxes, KubeBlocks Clusters, Certificates, and Ingresses.

Key changes:

  • Implements state management for 10+ resource types with encode/decode functions for preserving original configurations
  • Adds HPA (HorizontalPodAutoscaler) suspension/restoration logic for frontend-deployed applications
  • Introduces concurrent deletion with wait mechanisms for backup resources
  • Enhances error handling by collecting errors instead of failing fast

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
controllers/account/main.go Adds devbox API scheme registration for controller access
controllers/account/deploy/manifests/deploy.yaml Expands RBAC permissions for devboxes, certificates, ingresses, and HPAs
controllers/account/controllers/suspend_state.go Defines state structures and encode/decode functions for all resource types
controllers/account/controllers/suspend_state_test.go Comprehensive unit tests for state encode/decode functions
controllers/account/controllers/namespace_controller_test.go Integration tests covering suspend/resume workflows for all resource types
controllers/account/controllers/namespace_controller.go Core implementation of suspend/resume logic with state management

Comment on lines +2655 to +2744
// 列出所有资源
list, err := dynamicClient.Resource(gvr).Namespace(namespace).List(ctx, v12.ListOptions{})
if err != nil {
return fmt.Errorf("failed to list %s in namespace %s: %w", gvr, namespace, err)
}

if len(list.Items) == 0 {
return nil // 无资源需要删除
}

// 并发删除:使用WaitGroup和error channel收集错误
var wg sync.WaitGroup
errCh := make(chan error, len(list.Items)) // 缓冲channel,避免阻塞
allErrors := []error{}

for _, item := range list.Items {
name := item.GetName()
wg.Add(1)
go func(resName string) {
defer wg.Done()
if deleteErr := deleteResourceAndWait(dynamicClient, gvr, namespace, resName); deleteErr != nil {
errCh <- fmt.Errorf("failed to delete %s/%s: %w", gvr, resName, deleteErr)
}
}(name)
}

// 等待所有Goroutine完成,并收集错误
go func() {
wg.Wait()
close(errCh)
}()

for deleteErr := range errCh {
allErrors = append(allErrors, deleteErr)
}

if len(allErrors) > 0 {
return fmt.Errorf("failed to delete some %s resources: %v", gvr, allErrors)
}

return nil
}

func deleteResourceAndWait(
dynamicClient dynamic.Interface,
gvr schema.GroupVersionResource,
namespace, name string,
) error {
ctx := context.Background()
deletePolicy := v12.DeletePropagationForeground // 前台删除,等待子资源

// 执行删除(针对单个资源)
err := dynamicClient.Resource(gvr).Namespace(namespace).Delete(ctx, name, v12.DeleteOptions{
PropagationPolicy: &deletePolicy,
})
if err != nil && !errors.IsNotFound(err) {
return fmt.Errorf("failed to delete %s/%s: %w", gvr, name, err)
}
if errors.IsNotFound(err) {
return nil // 已不存在,无需等待
}

// 等待删除完成:轮询Get直到NotFound
pollInterval := 5 * time.Second
timeout := 5 * time.Minute // 根据finalizer复杂度调整
err = wait.PollUntilContextTimeout(ctx, pollInterval, timeout, true,
func(ctx context.Context) (bool, error) {
// 使用retry.Backoff可选重试Get(处理临时错误)
dErr := retry.OnError(wait.Backoff{
Steps: 5,
Duration: 10 * time.Second,
Factor: 1.0,
Jitter: 0.1,
}, func(err error) bool {
return errors.IsServerTimeout(err) || errors.IsServiceUnavailable(err)
}, func() error {
_, getErr := dynamicClient.Resource(gvr).
Namespace(namespace).
Get(ctx, name, v12.GetOptions{})
if errors.IsNotFound(getErr) {
return nil // 成功:资源已删除
}
if getErr != nil {
// 其它错误:继续轮询
return getErr
}
// 资源仍存在:继续轮询
return errors2.New("resource still exists")
})
return dErr == nil, dErr
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chinese comments should be translated to English for consistency with the rest of the codebase. Found comments like "列出所有资源", "无资源需要删除", "并发删除:使用WaitGroup和error channel收集错误", "缓冲channel,避免阻塞", "等待所有Goroutine完成,并收集错误", "前台删除,等待子资源", "执行删除(针对单个资源)", "已不存在,无需等待", "等待删除完成:轮询Get直到NotFound", "使用retry.Backoff可选重试Get(处理临时错误)", "成功:资源已删除", "其它错误:继续轮询", "资源仍存在:继续轮询", and "根据finalizer复杂度调整".

Copilot uses AI. Check for mistakes.
gvr schema.GroupVersionResource,
namespace string,
) error {
ctx := context.Background()
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context cancellation is not propagated. The function accepts a ctx parameter but creates a new context.Background() instead of using it. This means any cancellation or timeout from the caller will be ignored, potentially causing operations to run longer than expected.

Copilot uses AI. Check for mistakes.
Comment on lines +2699 to +2703
dynamicClient dynamic.Interface,
gvr schema.GroupVersionResource,
namespace, name string,
) error {
ctx := context.Background()
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context cancellation is not propagated. The function creates a new context.Background() instead of accepting and using a context from the caller. This means cancellation or timeouts cannot be properly handled.

Suggested change
dynamicClient dynamic.Interface,
gvr schema.GroupVersionResource,
namespace, name string,
) error {
ctx := context.Background()
ctx context.Context,
dynamicClient dynamic.Interface,
gvr schema.GroupVersionResource,
namespace, name string,
) error {

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant