Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client auto updates integration for {tctl,tsh} #47815

Merged
merged 16 commits into from
Oct 31, 2024

Conversation

vapopov
Copy link
Contributor

@vapopov vapopov commented Oct 22, 2024

In this PR implemented integration of client auto updates proposed in RFD

changelog: Client tools {tctl,tsh} auto-updates controlled by cluster configuration

client_tools_autoupdate.mov

Copy link

This pull request is automatically being deployed by Amplify Hosting (learn more).

Access this pull request here: https://pr-47815.d3pp5qlev8mo18.amplifyapp.com

@vapopov vapopov force-pushed the vapopov/client-auto-updates-for-tctl-tsh branch from cc3662d to 4ae86ca Compare October 22, 2024 16:09
Fix recursive version check for darwin platform
Fix cleanup for multi-package support
@vapopov vapopov force-pushed the vapopov/client-auto-updates-for-tctl-tsh branch from 4ae86ca to a4f88ec Compare October 23, 2024 21:42
@vapopov vapopov requested review from sclevine and hugoShaka October 23, 2024 21:43
@vapopov vapopov marked this pull request as ready for review October 23, 2024 21:43
@github-actions github-actions bot added size/md tctl tctl - Teleport admin tool tsh tsh - Teleport's command line tool for logging into nodes running Teleport. labels Oct 23, 2024
@vapopov vapopov force-pushed the vapopov/client-auto-updates-for-tctl-tsh branch from 9f6c224 to 9fe5ed7 Compare October 25, 2024 19:42
Update must be able to be canceled, re-execute with latest version or last updated
Show progress bar before request is made
@vapopov vapopov force-pushed the vapopov/client-auto-updates-for-tctl-tsh branch from 5e27ecf to e113ac7 Compare October 28, 2024 14:37
Comment on lines +382 to +386
// Display a progress bar before initiating the update request to inform the user that
// an update is in progress, allowing them the option to cancel before actual response
// which might be delayed with slow internet connection or complete isolation to CDN.
pw, finish := newProgressWriter(10)
defer finish()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to silence the progress writer if this is a non-interactive terminal? If so, no need to do this in this PR, we can fix later.

Comment on lines +253 to +260
var pkgNames []string
for _, pkg := range packages {
if err := u.update(ctx, pkg); err != nil {
pkgName := fmt.Sprint(uuid.New().String(), updatePackageSuffix)
if err := u.update(ctx, pkg, pkgName); err != nil {
return trace.Wrap(err)
}
pkgNames = append(pkgNames, pkgName)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uuid is here to have a temporary package location unique to the process? Then we copy the content and remove it?

Later, we do os.CreateTemp(u.toolsDir), putting the temporary package in the user's home rather than the system's default tmp? Is there any advantage/issue leading us to use the user's home as a temporary directory versus te default temp location?

Copy link
Contributor Author

@vapopov vapopov Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should store original binary in same directory, since this is not temporary files

~/.tsh/bin/tctl    symlink->    ~/.tsh/bin/UUID/tctl
~/.tsh/bin/tsh     symlink->    ~/.tsh/bin/UUID/tsh

misinterpreted you comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is with the archive, not the extracted location:
https://github.com/gravitational/teleport/pull/47815/files#diff-6ddd22bd400fc52c2062de3b1fc775fdaf05594f85c5ffa5a02262af40f5d62dR275

In general, it's unusual to use os.CreateTemp outside of /tmp.

Copy link
Contributor

@hugoShaka hugoShaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, have you done manual tests on this one yet?

Before backporting, can you add (in another PR) a list of manual test cases we'll want to add to the major test plan, and to execute after backporting the feature? For example:

- [ ] Test tsh updater
  - [ ] no call are made when automatic update version is off
  - [ ] can install a custom version when `TELEPORT_TOOLS_VERSION` is set
  - [ ] does nothing when target version == local version
  - [ ] does nothing is AUs are disabled for the cluster
  - [ ] only looks up new version on `tsh login`
  - [ ] downloads new version just once, then uses the local one
- [ ] Test tctl updater
  - [ ] no call are made when automatic update version is off
  - [ ] can install a custom version when `TELEPORT_TOOLS_VERSION` is set
  - [ ] does nothing when target version == local version
  - [ ] does nothing is AUs are disabled for the cluster
  - [ ] only looks up new version on `tsh login`
  - [ ] downloads new version just once, then uses the local one

@hugoShaka
Copy link
Contributor

Also, i'm very interested in testing the "proxy is not reachable" failure mode. Can we make sure we have proper timeouts to avoid hanging and that we're surfacing a clear and actionable error telling the proxy/CDN is not reachable?

I'm worried that users will see a "failed to automatically update tools" error and think AUs are the issue while they just don't have access to the proxy. (This is a very common teleport issue, we usually throw a bunch of "certificate invalid" errors and users think a certificate is expired/not trusted while they just can't reach the proxy).

@vapopov
Copy link
Contributor Author

vapopov commented Oct 28, 2024

Also, i'm very interested in testing the "proxy is not reachable" failure mode. Can we make sure we have proper timeouts to avoid hanging and that we're surfacing a clear and actionable error telling the proxy/CDN is not reachable?

We have TestUpdateInterruptSignal which verifies that we can cancel update with CDN stopped/slow responding and continue execution.

With unreachable proxy we have timeout in 30 second, which also might be canceled by Ctrl-C as part of update process and continue proper execution of the tsh, tctl

lib/autoupdate/tools/updater.go Show resolved Hide resolved
lib/client/api.go Show resolved Hide resolved
lib/utils/packaging/unarchive.go Outdated Show resolved Hide resolved
@vapopov
Copy link
Contributor Author

vapopov commented Oct 30, 2024

@hugoShaka @ryanclark @doggydogworld I would appreciate your review

@ryanclark
Copy link
Contributor

@hugoShaka @ryanclark @doggydogworld I would appreciate your review

I'm on PTO until next week

@vapopov vapopov requested a review from bernardjkim October 30, 2024 18:14
@hugoShaka
Copy link
Contributor

hugoShaka commented Oct 30, 2024

With unreachable proxy we have timeout in 30 second, which also might be canceled by Ctrl-C as part of update process and continue proper execution of the tsh, tctl

Thanks, that's very useful. Does it make sense to lower this timeout to 10 seconds? 30 second without feedback is very long for a user-facing tool. cc @sclevine

What happens after we hit the timeout (e.g. a firewall rule is silently dropping the traffic)? Do we skip the update check and still try to login, or do we stop and return an error? If it's the latter, can we make sure it's actionable for the user. This will become the first thing people see when they try to login but don't have network connectivity.

Copy link
Contributor

@hugoShaka hugoShaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the remaining active threads are resolved

@vapopov vapopov enabled auto-merge October 31, 2024 16:37
@vapopov
Copy link
Contributor Author

vapopov commented Oct 31, 2024

What happens after we hit the timeout (e.g. a firewall rule is silently dropping the traffic)? Do we skip the update check and still try to login, or do we stop and return an error? If it's the latter, can we make sure it's actionable for the user. This will become the first thing people see when they try to login but don't have network connectivity.

@hugoShaka update with proxy call happens when we do login action, so if proxy not available we return error for update because login also will fail if we skip update. I also changed timeout for 10 seconds for client tools

@vapopov vapopov added this pull request to the merge queue Oct 31, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 31, 2024
@vapopov vapopov added this pull request to the merge queue Oct 31, 2024
Merged via the queue into master with commit 9d2b5c5 Oct 31, 2024
42 checks passed
@vapopov vapopov deleted the vapopov/client-auto-updates-for-tctl-tsh branch October 31, 2024 17:28
vapopov added a commit that referenced this pull request Nov 7, 2024
* Client auto updates integration for tctl/tsh

* Add version validation
Fix recursive version check for darwin platform
Fix cleanup for multi-package support

* Fix identifying tools removal from home directory

* Replace ToolsMode with ToolsAutoUpdate

* Reuse insecure flag for tests

* Fix CheckRemote with login

* Fix windows administrative access requirement
Update must be able to be canceled, re-execute with latest version or last updated
Show progress bar before request is made

* Fix update cancellation for login action
Address review comments

* Add signal handler with stack context cancellation

* Use copy instead of hard link for windows
Fix progress bar if we can't receive size of package

* Replace with list in order to support manual cancel

* Download archive package to temp directory

* Decrease timeout for client tools proxy call
vapopov added a commit that referenced this pull request Nov 8, 2024
* Client auto updates integration for tctl/tsh

* Add version validation
Fix recursive version check for darwin platform
Fix cleanup for multi-package support

* Fix identifying tools removal from home directory

* Replace ToolsMode with ToolsAutoUpdate

* Reuse insecure flag for tests

* Fix CheckRemote with login

* Fix windows administrative access requirement
Update must be able to be canceled, re-execute with latest version or last updated
Show progress bar before request is made

* Fix update cancellation for login action
Address review comments

* Add signal handler with stack context cancellation

* Use copy instead of hard link for windows
Fix progress bar if we can't receive size of package

* Replace with list in order to support manual cancel

* Download archive package to temp directory

* Decrease timeout for client tools proxy call
vapopov added a commit that referenced this pull request Nov 8, 2024
* Client auto updates integration for tctl/tsh

* Add version validation
Fix recursive version check for darwin platform
Fix cleanup for multi-package support

* Fix identifying tools removal from home directory

* Replace ToolsMode with ToolsAutoUpdate

* Reuse insecure flag for tests

* Fix CheckRemote with login

* Fix windows administrative access requirement
Update must be able to be canceled, re-execute with latest version or last updated
Show progress bar before request is made

* Fix update cancellation for login action
Address review comments

* Add signal handler with stack context cancellation

* Use copy instead of hard link for windows
Fix progress bar if we can't receive size of package

* Replace with list in order to support manual cancel

* Download archive package to temp directory

* Decrease timeout for client tools proxy call
github-merge-queue bot pushed a commit that referenced this pull request Nov 11, 2024
* Client auto updates integration for {tctl,tsh} (#47815)

* Client auto updates integration for tctl/tsh

* Add version validation
Fix recursive version check for darwin platform
Fix cleanup for multi-package support

* Fix identifying tools removal from home directory

* Replace ToolsMode with ToolsAutoUpdate

* Reuse insecure flag for tests

* Fix CheckRemote with login

* Fix windows administrative access requirement
Update must be able to be canceled, re-execute with latest version or last updated
Show progress bar before request is made

* Fix update cancellation for login action
Address review comments

* Add signal handler with stack context cancellation

* Use copy instead of hard link for windows
Fix progress bar if we can't receive size of package

* Replace with list in order to support manual cancel

* Download archive package to temp directory

* Decrease timeout for client tools proxy call

* Add audit logs for auto update resources (#48218)
// At process startup, check if a version has already been downloaded to
// $TELEPORT_HOME/bin or if the user has set the TELEPORT_TOOLS_VERSION
// environment variable. If so, re-exec that version of {tsh, tctl}.
toolsDir, err := tools.Dir()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broke VNet in v17.0.1. VNet has a launch daemon on macOS which starts tsh vnet-daemon in an environment where $HOME doesn't exist.

Before 9d2b5c5, it was possible to execute tsh version with no env vars:

env -i "$(which tsh)" version

What do you reckon would be the best way forward? I don't want to set some dummy value as $HOME because I don't want to interfere with how launch daemons on macOS behave. Setting TELEPORT_TOOLS_VERSION=off is not enough as tools.Dir() is read before we check that env var.

Copy link
Member

@ravicious ravicious Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explicitly provide TELEPORT_HOME with a dummy dir so that tools.Dir() doesn't need to read HOME, but that doesn't work either. Not because of autoupdate, but because I think when you explicitly provide TELEPORT_HOME, tsh later tries to read the config file which it cannot in this case.

~: env -i TELEPORT_HOME=/dev/null TELEPORT_TOOLS_VERSION=off "$(which tsh)" version
ERROR: failed to load tsh config from "/dev/null/config/config.yaml"
        failed to execute command /dev/null/config/config.yaml error:  not a directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravicious thanks for reporting this one, will prepare fix this one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, keep me updated. I'll update the plist for the launch daemon with an appropriate env vars once ready.

// At process startup, check if a version has already been downloaded to
// $TELEPORT_HOME/bin or if the user has set the TELEPORT_TOOLS_VERSION
// environment variable. If so, re-exec that version of {tsh, tctl}.
toolsDir, err := tools.Dir()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the RFD says that Connect is going to call tsh with TELEPORT_TOOLS_VERSION=off, but from what I see this hasn't been implemented – grepping for TELEPORT_TOOLS_VERSION doesn't bring up anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by #49180

github-merge-queue bot pushed a commit that referenced this pull request Dec 4, 2024
* Expose client tools auto update for find endpoint (#46785)

* Expose client tools auto update for find endpoint

* Group auto update settings in find response
Log error instead returning error
Add tests auto update settings in find endpoint
Add check for not implemented error

* Add more test cases

* Client AutoUpdate proto structure changes (#47532)

* Update client autoupdate proto structure

* Replace with reserved

* Fix unit tests

* Add more info in proto

* Rename proto to be aligned RFD namings

* Replace enum type for ToolsMode to string

* Add packaging utility for client tools auto updates (#47060)

* Add packaging utility for client tools auto updates

* Add error handling for close functions

* Move archive to existing utils package

* Move archive helpers to integration/helper
CR changes

* CR changes

* CR changes

* CR changes
Replace creating directory with extract path as argument

* CR changes

* Validate full size before un-archive
Extract files to extractDir with ignore dir structure

* Change compressing with relative paths
Add test for cleanup and fix skip logic

* CR changes

* CR changes

* Fix linter

* Client tools auto update (#47466)

* Add client tools auto update

* Replace fork for posix platform for re-exec
Move integration tests to client tools specific dir
Use context cancellation with SIGTERM, SIGINT
Remove cancelable tee reader with context replacement
Renaming

* Fix syscall path execution
Fix archive cleanup if hash is not valid
Limit the archive write bytes

* Cover the case with single package for darwin platform after v17

* Move updater logic to tools package

* Move context out from the library
Base URL renaming

* Add more context in comments

* Changes in find endpoint

* Replace test http server with `httptest`
Replace hash for bytes matching
Proper temp file close for archive download

* Add more context to comments

* Move feature flag to main package to be reused

* Constant rename

* Replace build tag with lib/modules to identify enterprise build

* Replace fips tag with modules flag

* Client auto updates integration for {tctl,tsh} (#47815)

* Client auto updates integration for tctl/tsh

* Add version validation
Fix recursive version check for darwin platform
Fix cleanup for multi-package support

* Fix identifying tools removal from home directory

* Replace ToolsMode with ToolsAutoUpdate

* Reuse insecure flag for tests

* Fix CheckRemote with login

* Fix windows administrative access requirement
Update must be able to be canceled, re-execute with latest version or last updated
Show progress bar before request is made

* Fix update cancellation for login action
Address review comments

* Add signal handler with stack context cancellation

* Use copy instead of hard link for windows
Fix progress bar if we can't receive size of package

* Replace with list in order to support manual cancel

* Download archive package to temp directory

* Decrease timeout for client tools proxy call

* Add audit logs for auto update resources (#48218)

* Connect: Make sure tsh auto-updates are turned off (#49180)

* Add dir for code shared between Node.js processes

* Connect: Make sure tsh auto-updates are turned off

* Pass TELEPORT_TOOLS_VERSION=off to tsh vnet-daemon

* Disable client tools auto update disabled if there are no home dir (#49159)

Move updater to general tools package

* Move client auto update helper to lib package (#49247)

---------

Co-authored-by: Rafał Cieślak <rafal.cieslak@goteleport.com>
github-merge-queue bot pushed a commit that referenced this pull request Dec 11, 2024
* Expose client tools auto update for find endpoint (#46785)

* Expose client tools auto update for find endpoint

* Group auto update settings in find response
Log error instead returning error
Add tests auto update settings in find endpoint
Add check for not implemented error

* Add more test cases

* Client AutoUpdate proto structure changes (#47532)

* Update client autoupdate proto structure

* Replace with reserved

* Fix unit tests

* Add more info in proto

* Rename proto to be aligned RFD namings

* Replace enum type for ToolsMode to string

* Add packaging utility for client tools auto updates (#47060)

* Add packaging utility for client tools auto updates

* Add error handling for close functions

* Move archive to existing utils package

* Move archive helpers to integration/helper
CR changes

* CR changes

* CR changes

* CR changes
Replace creating directory with extract path as argument

* CR changes

* Validate full size before un-archive
Extract files to extractDir with ignore dir structure

* Change compressing with relative paths
Add test for cleanup and fix skip logic

* CR changes

* CR changes

* Fix linter

* Client tools auto update (#47466)

* Add client tools auto update

* Replace fork for posix platform for re-exec
Move integration tests to client tools specific dir
Use context cancellation with SIGTERM, SIGINT
Remove cancelable tee reader with context replacement
Renaming

* Fix syscall path execution
Fix archive cleanup if hash is not valid
Limit the archive write bytes

* Cover the case with single package for darwin platform after v17

* Move updater logic to tools package

* Move context out from the library
Base URL renaming

* Add more context in comments

* Changes in find endpoint

* Replace test http server with `httptest`
Replace hash for bytes matching
Proper temp file close for archive download

* Add more context to comments

* Move feature flag to main package to be reused

* Constant rename

* Replace build tag with lib/modules to identify enterprise build

* Replace fips tag with modules flag

* Client auto updates integration for {tctl,tsh} (#47815)

* Client auto updates integration for tctl/tsh

* Add version validation
Fix recursive version check for darwin platform
Fix cleanup for multi-package support

* Fix identifying tools removal from home directory

* Replace ToolsMode with ToolsAutoUpdate

* Reuse insecure flag for tests

* Fix CheckRemote with login

* Fix windows administrative access requirement
Update must be able to be canceled, re-execute with latest version or last updated
Show progress bar before request is made

* Fix update cancellation for login action
Address review comments

* Add signal handler with stack context cancellation

* Use copy instead of hard link for windows
Fix progress bar if we can't receive size of package

* Replace with list in order to support manual cancel

* Download archive package to temp directory

* Decrease timeout for client tools proxy call

* Add audit logs for auto update resources (#48218)

* Connect: Make sure tsh auto-updates are turned off

* Add dir for code shared between Node.js processes

* Connect: Make sure tsh auto-updates are turned off

* Pass TELEPORT_TOOLS_VERSION=off to tsh vnet-daemon

* Disable client tools auto update disabled if there are no home dir (#49159)

Move updater to general tools package

* Move client auto update helper to lib package (#49247)

---------

Co-authored-by: Rafał Cieślak <rafal.cieslak@goteleport.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/md tctl tctl - Teleport admin tool tsh tsh - Teleport's command line tool for logging into nodes running Teleport.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants