Fix race when creating secrets and Kubernetes jobs #4319
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When creating secrets and a Kubernetes Job that uses those secrets in volumes, it is possible that the job is created before the secret has been synchronized. If the job required the content of the secret to be mounted, the file would be missing.
Also fixes usage of
controllerutil.CreateOrUpdate.Possibly related to kubernetes-sigs/secrets-store-csi-driver#1051.
Additional Information
The implementation of the annotations I didn't test. It only would be used in case of updates, which in turn would only happen if the reconciler was unable to complete the creation of the job after the secret was created, so that it would have to re-reconcile. That case should be fairly rare, but since it can happen, I have included this mechanism that would take care of that. In my tests I never hit the point where the secret could be retrieved but wasn't equal to the one that was created and/or updated.
On the other hand, hitting a case where the secret could not be fetched right after it's creation, I have encountered about every 10th attempt.
The tests that seem to be most affected by this change are the infra setup tests for git, helm and OCI. I noticed that because they all failed when I still used
retry.RetryOnConflictwhich is not suitable for this case, it requires an error of typeConflictto trigger the back-off. Other errors end the attempts of retrying. Which I have replaced that functionality withretry.OnError.Checklist
fleet-docs repository.