Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest version of stembuild construct hangs forever #27

Closed
sneal opened this issue Feb 21, 2024 · 7 comments
Closed

Latest version of stembuild construct hangs forever #27

sneal opened this issue Feb 21, 2024 · 7 comments
Assignees

Comments

@sneal
Copy link
Contributor

sneal commented Feb 21, 2024

The recent dependency updates pulled in this winrm client library update which changes winrm command "timeout error" handling.

Old working stembuild logs:

Created directory for audit policy
Copied C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\microsoft\windows nt\Audit\audit.csv
to C:\Windows\system32\GroupPolicy\Machine\Microsoft\Windows NT\Audit\audit.csv
Clearing existing audit policy
Apply Audit policy from C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\microsoft\windows nt\Audit\audit.csv
Apply security template: C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\microsoft\windows nt\SecEdit\GptTmpl.inf
Import Machine settings from registry.pol: C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\registry.pol
Import User settings from registry.pol: C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\User\registry.pol

winrm connection event: unknown error Post "http://10.220.41.242:5985/wsman": read tcp 10.220.41.9:34002->10.220.41.242:5985: read: connection timed out

Finished executing setup script 2 of 2.
2023-12-14T14:35:56.530062+00:00 Still preparing VM...
VM has now been shutdown. Run `stembuild package` to finish building the stemcell.
Waiting for stemcell disk packaging and shutdown
Pass 1:   Power state:  poweredOff
Waiting 30 seconds and will check then...
Pass 2:   Power state:  poweredOff
Waiting 30 seconds and will check then...
Waiting loop for stemcell shutdown complete.

Latest hanging stembuild logs:

Created directory for audit policy
Copied C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\microsoft\windows nt\Audit\audit.csv
to C:\Windows\system32\GroupPolicy\Machine\Microsoft\Windows NT\Audit\audit.csv
Clearing existing audit policy
Apply Audit policy from C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\microsoft\windows nt\Audit\audit.csv
Apply security template: C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\microsoft\windows nt\SecEdit\GptTmpl.inf
Import Machine settings from registry.pol: C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\registry.pol
Import User settings from registry.pol: C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\User\registry.pol

What's happening is that "script 2 of 2" calls sysprep at the end and shuts down the VM and causes a winrm connection timeout that the winrm client library retries forever (new behavior) where previously it would get an unknown error Post "http://10.220.41.242:5985/wsman": dial tcp 10.220.41.242:5985: i/o timeout and "finish the command" allowing stembuild to proceed.

@ZhudongVm
Copy link

ZhudongVm commented Mar 12, 2024

We found if the windows host and stembuild utility in same subset, It can passed with "no route to host" after almost 5 minutes waiting.

Created directory for audit policy
Copied C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\microsoft\windows nt\Audit\audit.csv
to C:\Windows\system32\GroupPolicy\Machine\Microsoft\Windows NT\Audit\audit.csv
Clearing existing audit policy
Apply Audit policy from C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\microsoft\windows nt\Audit\audit.csv
Apply security template: C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\microsoft\windows nt\SecEdit\GptTmpl.inf
Import Machine settings from registry.pol: C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\Machine\registry.pol
Import User settings from registry.pol: C:\Program Files\WindowsPowerShell\Modules\BOSH.Sysprep\cis-merge-2019\DomainSysvol\GPO\User\registry.pol

winrm connection event: unknown error Post "http://192.168.111.66:5985/wsman": dial tcp 192.168.111.66:5985: connect: no route to host

Finished executing setup script 2 of 2.
2024-03-07T01:28:48.140129-08:00 Still preparing VM...
VM has now been shutdown. Run stembuild package to finish building the stemcell.

@jpalermo
Copy link
Member

Yeah, I don't see a change in our nightly builds. Both back before the change (~November) and now have a 4-5 hang in there, but then continue.

Our CI worker and the VM we're running stembuild against do both run in the same subnet though.

@sneal is the subnet thing possibly why you're experiencing this and we're not?

@ZhudongVm
Copy link

ZhudongVm commented Mar 19, 2024

Hi, @jpalermo one customer encountered this issue due to they always deploy the stembuild utility and windows vm in different subnet. They also can constructed stemcell successfully before this pr merged.
And afterwards comparing with our environment, we found this difference, They customer attempt to modify the ip in same subnet and it's success to construct stemcell.
So we consider this pr change impacts the winrm command execution exit , and then cause the hang situation in customer side.
If it's possible to add one force exit after 5~10min(supposed) to avoid hanging forever in that step.

@sneal
Copy link
Contributor Author

sneal commented Mar 19, 2024

@jpalermo You will see different results depending up whether on the same subnet or not.

The winrm library we're using now retries on a connection timeout forever and only returns once the command has executed to completion. In our case since we execute a command that shuts the VM down, it'll never complete the command (from the winrm library perspective). Golang will return a different error to the winrm library (timeout vs no route to host) depending upon if stembuild is run in the same or different subnets.

While the winrm library should have some sort of timeout and give clients more control over retry behavior, we really just need to execute the final command (sysprep & shutdown) while simultaneously polling the VM for shutdown via some mechanism. Once shutdown is detected we can continue.

@jpalermo
Copy link
Member

Opened a PR to provide a config flag to disable the new behavior.

jpalermo added a commit that referenced this issue May 6, 2024
…g forever.

Issue: #27

This fork adds an option to roll back the "automatic retry on timeout" behavior
that was added to winrm
sneal added a commit to sneal/stembuild that referenced this issue Jun 25, 2024
Add rollback of "automatic retry on timeout" in winrm when executing the post reboot script. This should fix the hang reported during sysprep.
@jpalermo
Copy link
Member

@sneal, is this all fixed now?

@jpalermo
Copy link
Member

Sounds like we believe it's fixed for now, but feel free anybody to reopen the issue if you still see the problem.

@github-project-automation github-project-automation bot moved this from Pending Review | Discussion to Done in Foundational Infrastructure Working Group Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants