Skip to content

Conversation

@trashhalo
Copy link

Summary

This PR introduces the CancelHealthCheckOnNewRevision feature flag to prevent kustomize-controller from getting stuck waiting for health check timeouts when new source revisions containing potential fixes are available.

Problem

Currently, when a Kustomization fails health checks (e.g., due to a bad deployment), the controller waits for the full timeout duration (typically 30 seconds) before processing any new revisions. This means that even if a fix is pushed immediately after the failing commit, users must wait for the full timeout before the fix is applied.

Solution

  • New opt-in feature flag: CancelHealthCheckOnNewRevision (default: false)
  • Revision monitoring: During health checks, monitor for new source revisions every 5 seconds
  • Early cancellation: Cancel ongoing health checks when new revisions are detected
  • Immediate processing: Process new revisions immediately after cancellation instead of waiting for timeout

Behavior Change

Before (existing behavior):

[Bad commit] → Health check starts → Wait 30s → Timeout → Process new commit
Total time: ~30+ seconds

After (with feature enabled):

[Bad commit] → Health check starts → [New commit arrives] → Cancel health check (~5s) → Process new commit immediately  
Total time: ~5-10 seconds

Implementation Details

  1. Feature flag integration: Uses the existing feature gate system
  2. Context-based cancellation: Creates cancellable contexts for health checks
  3. Background monitoring: Goroutine monitors source revisions during health checks
  4. Graceful cancellation: Proper cleanup and error handling when cancelling
  5. Backward compatibility: Preserves existing behavior when feature is disabled

Testing

Comprehensive test coverage includes:

  1. TestKustomizationReconciler_CancelHealthCheckOnNewRevision:

    • Verifies feature works when enabled
    • Demonstrates ~5 second cancellation vs 30 second timeout
    • Confirms immediate processing of new revisions
  2. TestKustomizationReconciler_NoHealthCheckCancellation_WhenFeatureDisabled:

    • Verifies original behavior when feature is disabled
    • Demonstrates full 30 second timeout behavior
    • Ensures backward compatibility

Usage

Enable the feature by starting kustomize-controller with:

--feature-gates=CancelHealthCheckOnNewRevision=true

Benefits

  • Faster recovery: 6x faster processing of fixes (5s vs 30s)
  • Improved user experience: Reduced delay between pushing fixes and seeing them applied
  • Production ready: Opt-in feature with comprehensive testing
  • Safe: Preserves existing behavior when disabled

🤖 Generated with Claude Code

@trashhalo trashhalo force-pushed the feat/cancel-health-check-on-new-revision branch from c1db411 to d2bf2fc Compare September 24, 2025 16:48
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @trashhalo can you please signoff your commit and address my comments

@trashhalo
Copy link
Author

@stefanprodan thanks for taking a pass! I'll get this all cleaned up based on your feedback over the next day or so.

@stefanprodan
Copy link
Member

@trashhalo we are release kustomize-controller today, so if we want this to be part of Flux 2.7 this PR needs to be merged today. Otherwise this feature will have to wait till next year when we'll release Flux 2.8

…ck on failing commits

This feature allows health checks to be cancelled when a new source revision
becomes available, preventing the controller from getting stuck waiting for
full timeout durations when fixes are already available.

Features:
- New opt-in feature flag: CancelHealthCheckOnNewRevision (default: false)
- Health checks are cancelled early when new revisions are detected (~5s vs 5min timeout)
- Uses the new WaitForSetWithContext method for clean context-based cancellation
- Preserves existing behavior when feature is disabled

The implementation monitors source revisions during health checks and cancels
ongoing checks when new revisions are available, allowing immediate processing
of potential fixes instead of waiting for full timeout periods.

Signed-off-by: Stephen Solka <[email protected]>
@trashhalo trashhalo force-pushed the feat/cancel-health-check-on-new-revision branch from d2bf2fc to ecfdfea Compare September 25, 2025 11:43
@stefanprodan
Copy link
Member

Ok we had enough of Claude Code, reviewing this for you to input the review in AI and push here takes too much of our time.

@matheuscscp is going to rewrite this PR, you can stop pushing changes.

@matheuscscp
Copy link
Member

Superseded by #1520

@trashhalo
Copy link
Author

Ok we had enough of Claude Code, reviewing this for you to input the review in AI and push here takes too much of our time.

@matheuscscp is going to rewrite this PR, you can stop pushing changes.

I'm sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants