Skip to content

Conversation

@okraport
Copy link
Contributor

@okraport okraport commented Nov 12, 2025

This change allows the agent dialer to use cached
results for hostCheckerFunc. In the case of a transient grpc network failure, this allows the reverse tunnel to be re-established before the grpc timeout takes place.

Test plan:

  • Spawn two Linux VMs, one for auth/proxy and one for node. Simulate network change by toggling one of two interfaces on the node, measure time for the reverse tunnel to be functional again.

In local testing with 5 runs this on average brings the recovery time to 2m39s from 3m45s.

Changelog: Improve reverse tunnel dialing recovery from default route changes by 1min on average.

@okraport
Copy link
Contributor Author

run_until_success.sh
If anyone would like to recreate I have thrown this ugly bash script to run tsh ssh with timeout and monitor the interfaces of the target VM to help with testing.

@okraport
Copy link
Contributor Author

Note that automating this test is challenging, and I believe this should be tackled separately.

@okraport okraport marked this pull request as ready for review November 12, 2025 09:48
@github-actions github-actions bot requested a review from cthach November 12, 2025 09:49
@rosstimothy
Copy link
Contributor

@okraport should this be backported to v17 as well?

Copy link
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that this is not a guarantee that we can connect, if the cache is also unhealthy or has failed to initialize and connectivity to auth is down we'll still error out when getting CAs.

@public-teleport-github-review-bot public-teleport-github-review-bot bot removed the request for review from cthach November 12, 2025 18:02
@okraport
Copy link
Contributor Author

@okraport should this be backported to v17 as well?

Yes, thank you for pointing that out. Added the label.

This change allows the agent dialer to use cached
results for hostCheckerFunc. In the case of a transient
grpc network failure, this allows the reverse tunnel
to be re-established before the grpc timeout takes place.

Test plan:

- [ ] Spawn two Linux VMs, one for auth/proxy and one for node.
      Simulate network change by toggling one of two interfaces on the node,
      measure time for the reverse tunnel to be functional again.

In local testing with 5 runs this on average brings the recovery time
to 2m39s from 3m45s.

Changelog: Improve reverse tunnel dialing recovery from default route changes by 1min on average.
@okraport okraport force-pushed the okraport/reversetunnel-dialer-use-cache branch from c3ffc83 to 49a1c52 Compare November 13, 2025 09:44
@okraport okraport enabled auto-merge November 13, 2025 10:02
@okraport okraport added this pull request to the merge queue Nov 13, 2025
Merged via the queue into master with commit 090e12c Nov 13, 2025
42 checks passed
@okraport okraport deleted the okraport/reversetunnel-dialer-use-cache branch November 13, 2025 10:27
@backport-bot-workflows
Copy link
Contributor

@okraport See the table below for backport results.

Branch Result
branch/v17 Create PR
branch/v18 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants