-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimum Cluster Size = 0 will not spawn any agent #411
Comments
I wasn't able to reproduce this bug with v3.0.1 plugin and Jenkins version 2.401.1. Do you have a job scheduled with Note that the default value of If these are not applicable/you already tried them, can you include the logs emitted by Jenkins? |
It looks like 3.0.1 creates new nodes when jobs first get in the queue, but then the jobs don't actually execute on the new nodes and only queue on the one existing node. |
@amainwaring It is possible that there is some differences in our setups that create the conditions for this bug. Can you share your CasC config from |
Confirm topic starter bug.
Also, in case of next config:
I've got 3 fleets for 3 jobs. And 4th job stay in queue with state Plugin Version - 3.0.1 PS: Reverting plug-in to version 3.0.0 fix this problem. |
We've noticed similar. It seems that the NodeProvisioner thinks that there is sufficient available planned capacity for the fleet and doesn't need to do anything (I don't really know where that planned capacity number comes from):
We don't have CasC set up on our Jenkins instance yet, so can't export the configuration as code but our setup is similar to other reports:
|
@willthames you can probably just go to https://jenkins/configuration-as-code/ to export it. I just tried to export mine though and got the following error, which may or may-not be related to this issue..
edit: ^^ might be because this is in the lab, I've reverted the plugin to 3.0.0 but haven't restarted yet. I tried from a production Jenkins that I've rolled back to 3.0.0 and restarted and got the following:
|
@amainwaring If this is the cause issue, I will see if there is a workaround as both of these are not preferred behavior. If I still cannot recreate it by starting with plugin version 3.0.0 and upgrading to 3.0.1, is there a way for you to see if |
Hi @vineeth-bandi! Thanks for taking a look at this! I definitely restarted Jenkins after updating from 3.0.0 to 3.0.1, but when I downgraded our lab environment there were a lot of jobs running and I haven't had a chance to restart it yet. Prod has a bit of a different workload so I was able to properly restart those after downgrading the plugin. When I was using 3.0.1 I tried updating the fleet config, changing min/max values, and restarting Jenkins. I did notice one time I updated the fleet to have min 10 and max 24, and it did scale out properly to 24 nodes. But after it scaled back down to 1 it didn't scale back up again properly. We reverted to 3.0.0 pretty quickly after that though so didn't have a lot of time to investigate much. |
Hi, I'm also hit by this bug, which went away with a downgrade from 3.0.1 to 3.0.0, with very similar conditions:
|
@amainwaring Thanks for the details. Can you share logs from the scenario you described? If you can generate logs with the logger configuration in the screenshot, that will be very helpful: |
@pdk27 I'm really sorry, I've reverted everything back to 3.0.0 so we were able to scale again. Generally we are scaling up Jenkins at night, so when this has happened I've been engaged after-hours and was primarily more concerned with making sure we had enough workers rather than gathering logs. |
@amainwaring Makes sense. @bzoks @willthames @carpool-michael @kt315ua Can you please share your logs with the logger configuration above? It will help us troubleshoot the issue as we are unable to reproduce it. |
I upgraded again, activated logging,... but the issue did NOT happen again. Tried also restarts, multiple runs,... no way to reproduce anymore. |
Interesting details! Thanks for sharing @bzoks. |
I have the same issue with 2.414.2 and 3.0.1 Linux Controller and Windows Agent |
I had the same experience as bzoks. Downgrading to 3.0.0 fixed the problem. Then when I upgrading back to 3.0.1 the issue was gone. |
I had the same issue. Jenkins Version 2.414.2. Downgrading to 3.0.0 fixed the problem. AWS linux master and spot cloud agent on linux |
We are also experiencing the issue and as recommended by others downgrading to 3.0.0 fixed the problems. The logs don't say anything special except for:
This is logs when the all available agents were busy and the queue was still large. And in situations where no nodes were online jobs, could be stuck in queue for hours. This is mostly visible for us during nights when mostly scheduled jobs are starting. |
We have decided to revert the changes that were part of this previous release as we were unable to reproduce and fully evaluate why these issues were appearing. Issue #417 will be tracking these changes if we decide to reintroduce them. Feel free to move discussion to that issue, or reopen this issue if reverting these changes by updating to the newest release (version 3.0.2) still causes issues. |
Issue Details
Describe the bug
Since v3.0.1, with the following configuration the plugin won't start an agent:
Max Idle Minutes Before Scaledown = 1
Minimum Cluster Size = 0
Maximum Cluster Size = 1
If I rollback to v3.0.0, it will work again as expected
Environment Details
Plugin Version?
3.0.1
Jenkins Version?
2.401.1
Spot Fleet or ASG?
ASG
Label based fleet?
Yes
Linux or Windows?
Linux Controller
Windows Agents
EC2Fleet Configuration as Code
``
clouds:
addNodeOnlyIfRunning: false
alwaysReconnect: true
awsCredentialsId: ***
cloudStatusIntervalSec: 10
computerConnector:
sSHConnector:
credentialsId: ***
javaPath: "D:\Jenkins\jdk-11\bin\java.exe"
launchTimeoutSeconds: 60
maxNumRetries: 10
port: 22
prefixStartSlaveCmd: "cd /d D:\ &&"
retryWaitTime: 15
sshHostKeyVerificationStrategy:
manuallyTrustedKeyVerificationStrategy:
requireInitialManualTrust: false
disableTaskResubmit: false
fleet: "JenkinsWindowsAgents"
fsRoot: "D:\Jenkins"
idleMinutes: 1
initOnlineCheckIntervalSec: 15
initOnlineTimeoutSec: 180
labelString: "windows"
maxSize: 1
maxTotalUses: -1
minSize: 0
minSpareSize: 0
name: "Windows Fleet"
noDelayProvision: false
numExecutors: 1
privateIpUsed: true
region: "eu-west-3"
restrictUsage: true
scaleExecutorsByWeight: false
``
Anything else unique about your setup?
No
The text was updated successfully, but these errors were encountered: