Minimum Cluster Size = 0 will not spawn any agent #411

carpool-michael · 2023-10-05T09:09:58Z

Issue Details

Describe the bug
Since v3.0.1, with the following configuration the plugin won't start an agent:
Max Idle Minutes Before Scaledown = 1
Minimum Cluster Size = 0
Maximum Cluster Size = 1

If I rollback to v3.0.0, it will work again as expected

Environment Details

Plugin Version?
3.0.1

Jenkins Version?
2.401.1

Spot Fleet or ASG?
ASG

Label based fleet?
Yes

Linux or Windows?
Linux Controller
Windows Agents

EC2Fleet Configuration as Code
``
clouds:

eC2Fleet:
addNodeOnlyIfRunning: false
alwaysReconnect: true
awsCredentialsId: ***
cloudStatusIntervalSec: 10
computerConnector:
sSHConnector:
credentialsId: ***
javaPath: "D:\Jenkins\jdk-11\bin\java.exe"
launchTimeoutSeconds: 60
maxNumRetries: 10
port: 22
prefixStartSlaveCmd: "cd /d D:\ &&"
retryWaitTime: 15
sshHostKeyVerificationStrategy:
manuallyTrustedKeyVerificationStrategy:
requireInitialManualTrust: false
disableTaskResubmit: false
fleet: "JenkinsWindowsAgents"
fsRoot: "D:\Jenkins"
idleMinutes: 1
initOnlineCheckIntervalSec: 15
initOnlineTimeoutSec: 180
labelString: "windows"
maxSize: 1
maxTotalUses: -1
minSize: 0
minSpareSize: 0
name: "Windows Fleet"
noDelayProvision: false
numExecutors: 1
privateIpUsed: true
region: "eu-west-3"
restrictUsage: true
scaleExecutorsByWeight: false
``

Anything else unique about your setup?
No

vineeth-bandi · 2023-10-05T15:27:57Z

I wasn't able to reproduce this bug with v3.0.1 plugin and Jenkins version 2.401.1. Do you have a job scheduled with windows set as the label?

Note that the default value of minSpareSize is 0, which means that no nodes are provisioned if a job with a matching label is not actively running or in queue. Alternatively, if restrictUsage is set to false then any jobs can run with this cloud.

If these are not applicable/you already tried them, can you include the logs emitted by Jenkins?

amainwaring · 2023-10-06T18:11:38Z

We are also having issues scaling, but our min size is 1. We have restrictUsage set so only jobs that match the specific labels can run on each ASG. One ASG has minSpareSize set to 1 and the other ASG has it 0, but neither can scale. I see messages like the following in the logs, but there are a large number of items waiting in the queue. If I hover over the jobs I see they are all waiting on the one specific worker, but if I revert the plugin to 3.0.0 and hover over a job I can see that it's waiting other nodes in the fleet, rather than just the single instance.

Oct 04, 2023 11:30:09 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [windows2019]: No excess workload, provisioning not needed.
Oct 04, 2023 11:30:09 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [windows||windows2019]: No excess workload, provisioning not needed.
Oct 04, 2023 11:30:19 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [windows2019]: No excess workload, provisioning not needed.
Oct 04, 2023 11:30:19 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [windows||windows2019]: No excess workload, provisioning not needed.
Oct 04, 2023 11:30:24 PM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy postJobAction

amainwaring · 2023-10-06T18:13:11Z

It looks like 3.0.1 creates new nodes when jobs first get in the queue, but then the jobs don't actually execute on the new nodes and only queue on the one existing node.

vineeth-bandi · 2023-10-09T16:41:38Z

@amainwaring
I still was not able to reproduce this bug. I created two clouds with cloud1 having minSize = 0 and minSpareSize = 0 and cloud2 having minSize = 0 and minSpareSize = 1 both with different ASG's and labels. I then created a job that runs with label1 while only cloud2 has a node running for label label2. I observe a new instance being created and a new node being created after Jenkins successfully connects to the new instance. The job is successfully run on the node running jobs for label1 managed by cloud1.

It is possible that there is some differences in our setups that create the conditions for this bug. Can you share your CasC config from Manage Jenkins -> Configuration as Code -> View Configuration (be sure to redact information as needed). As well as the logs from at least one offline node Manage Jenkins -> Nodes and Clouds -> SpecificNode -> Log. Thanks.

kt315ua · 2023-10-09T17:39:37Z

Confirm topic starter bug.
Fleet instances doesn't want to bring up.

Oct 09 17:16:03 jenkins.host jenkins[3153713]: 2023-10-09 14:16:03.199+0000 [id=55]        INFO     
c.a.j.e.NoDelayProvisionStrategy#apply: label [fleet-slave]: No excess workload, provisioning not needed.

Also, in case of next config:

Max Idle Minutes Before Scaledown = 5 min
Minimum Cluster Size = 3
Maximum Cluster Size = 9
Minimum Spare Size = 0
Maximum Total Uses = -1

I've got 3 fleets for 3 jobs. And 4th job stay in queue with state Waiting for next available executor on ‘[fleet-slave]'.
4th fleet aren't coming up.

Plugin Version - 3.0.1
Jenkins Version - 2.414.2

PS: Reverting plug-in to version 3.0.0 fix this problem.

willthames · 2023-10-10T01:53:56Z

We've noticed similar. It seems that the NodeProvisioner thinks that there is sufficient available planned capacity for the fleet and doesn't need to do anything (I don't really know where that planned capacity number comes from):

Consulting com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy@5f18016f provisioning strategy with state StrategyState{label=REDACTED, snapshot=LoadStatisticsSnapshot{definedExecutors=0, onlineExecutors=0, connectingExecutors=0, busyExecutors=0, idleExecutors=0, availableExecutors=0, queueLength=2}, plannedCapacitySnapshot=2, additionalPlannedCapacity=0}
label [REDACTED]: queueLength 2 availableCapacity 2 (availableExecutors 0 plannedCapacitySnapshot 2 additionalPlannedCapacity 0)
label [REDACTED]: No excess workload, provisioning not needed.
Provisioning strategy com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy@5f18016f declared provisioning complete
ran update on REDACTED in 0ms

We don't have CasC set up on our Jenkins instance yet, so can't export the configuration as code but our setup is similar to other reports:

min size 0
max size 2
min spare size 0
max idle 2 minutes
max total uses 10
max initial connection timeout 3 minutes
cloud status interval 10 seconds

amainwaring · 2023-10-10T13:00:34Z

@willthames you can probably just go to https://jenkins/configuration-as-code/ to export it. I just tried to export mine though and got the following error, which may or may-not be related to this issue..

 clouds: |-
    FAILED TO EXPORT
    hudson.model.Hudson#clouds: java.lang.NullPointerException
      at com.amazon.jenkins.ec2fleet.EC2FleetCloud.getInitOnlineCheckIntervalSec(EC2FleetCloud.java:301)
    Caused: java.lang.reflect.InvocationTargetException
      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.base/java.lang.reflect.Method.invoke(Method.java:566)
      at io.jenkins.plugins.casc.Attribute._getValue(Attribute.java:469)
    Caused: io.jenkins.plugins.casc.ConfiguratorException: Can't read attribute 'initOnlineCheckIntervalSec' from com.amazon.jenkins.ec2fleet.EC2FleetCloud@124427bb
      at io.jenkins.plugins.casc.Attribute._getValue(Attribute.java:480)
      at io.jenkins.plugins.casc.Attribute.getValue(Attribute.java:233)
      at io.jenkins.plugins.casc.Attribute.equals(Attribute.java:339)

edit: ^^ might be because this is in the lab, I've reverted the plugin to 3.0.0 but haven't restarted yet. I tried from a production Jenkins that I've rolled back to 3.0.0 and restarted and got the following:

clouds:
  - eC2Fleet:
      cloudStatusIntervalSec: 60
      computerConnector:
        sSHConnector:
          credentialsId: "jenkins"
          launchTimeoutSeconds: 1200
          maxNumRetries: 30
          port: 22
          prefixStartSlaveCmd: "java -version && D: &&"
          retryWaitTime: 15
          sshHostKeyVerificationStrategy: "nonVerifyingKeyVerificationStrategy"
      fleet: "jenkins-asg-small"
      fsRoot: "D:\\jenkins"
      idleMinutes: 5
      initOnlineTimeoutSec: 1200
      labelString: "QuickJobs asgcleanup"
      maxSize: 8
      minSize: 1
      name: "jenkins-asg-small"
      noDelayProvision: true
      numExecutors: 6
      privateIpUsed: true
      region: "us-east-1"
      restrictUsage: true
  - eC2Fleet:
      cloudStatusIntervalSec: 60
      computerConnector:
        sSHConnector:
          credentialsId: "jenkins"
          launchTimeoutSeconds: 1200
          maxNumRetries: 30
          port: 22
          prefixStartSlaveCmd: "java -version && D: &&"
          retryWaitTime: 15
          sshHostKeyVerificationStrategy: "nonVerifyingKeyVerificationStrategy"
          tcpNoDelay: false
      fleet: "jenkins-asg"
      fsRoot: "D:\\jenkins"
      idleMinutes: 5
      initOnlineTimeoutSec: 1200
      labelString: "windows windows2019 SiteLord asgcleanup"
      maxSize: 24
      minSize: 1
      minSpareSize: 1
      name: "jenkins-asg"
      noDelayProvision: true
      numExecutors: 6
      privateIpUsed: true
      region: "us-east-1"
      restrictUsage: true

vineeth-bandi · 2023-10-10T14:58:16Z

@amainwaring
Just so I understand and can investigate properly, when you first encountered this bug did you upgrade from 3.0.0 to 3.0.1 without a restart to Jenkins? If so, I have a feeling that the most recent changes might require a restart or a manual save of the cloud to force it to be recreated.

If this is the cause issue, I will see if there is a workaround as both of these are not preferred behavior. If I still cannot recreate it by starting with plugin version 3.0.0 and upgrading to 3.0.1, is there a way for you to see if Configure Clouds -> Save or a restart of Jenkins fixes your issue?

amainwaring · 2023-10-10T15:27:51Z

Hi @vineeth-bandi! Thanks for taking a look at this!

I definitely restarted Jenkins after updating from 3.0.0 to 3.0.1, but when I downgraded our lab environment there were a lot of jobs running and I haven't had a chance to restart it yet. Prod has a bit of a different workload so I was able to properly restart those after downgrading the plugin.

When I was using 3.0.1 I tried updating the fleet config, changing min/max values, and restarting Jenkins. I did notice one time I updated the fleet to have min 10 and max 24, and it did scale out properly to 24 nodes. But after it scaled back down to 1 it didn't scale back up again properly. We reverted to 3.0.0 pretty quickly after that though so didn't have a lot of time to investigate much.

bzoks · 2023-10-10T19:53:12Z

Hi, I'm also hit by this bug, which went away with a downgrade from 3.0.1 to 3.0.0, with very similar conditions:

min cluster size 0
max cluster size 4
min spare size 0
number of executors 1
max idle minutes before scaledown > 0 (2)
restricted usage to jobs with label
but using ec2 spot fleet, linux controller and agents.
Running single pipeline with parallel stages, one stage per node (4 stages = 4 nodes).
When we run this pipeline, when it works, it starts 4 nodes. With 3.0.1, it starts ZERO nodes, we left it for over 30 minutes and no change.
Interesting observation: I started one node manually on EC2 (configured target capacity on Spot Instance to 1), it was recognized by Jenkins and used - but only this one, no scaling towards max 4.

pdk27 · 2023-10-11T07:30:03Z

@amainwaring Thanks for the details. Can you share logs from the scenario you described? If you can generate logs with the logger configuration in the screenshot, that will be very helpful:

amainwaring · 2023-10-11T14:17:31Z

@pdk27 I'm really sorry, I've reverted everything back to 3.0.0 so we were able to scale again. Generally we are scaling up Jenkins at night, so when this has happened I've been engaged after-hours and was primarily more concerned with making sure we had enough workers rather than gathering logs.

pdk27 · 2023-10-12T11:47:33Z

@amainwaring Makes sense.

@bzoks @willthames @carpool-michael @kt315ua Can you please share your logs with the logger configuration above? It will help us troubleshoot the issue as we are unable to reproduce it.

bzoks · 2023-10-12T17:28:35Z

I upgraded again, activated logging,... but the issue did NOT happen again. Tried also restarts, multiple runs,... no way to reproduce anymore.
I'm guessing that it MIGHT be something related to configuration changes across major versions: I upgraded to 3.0.1 from one of the 2.* versions (don't know exactly)... and then the issue was present, but after downgrade to 3.0.0 and upgrade back to 3.0.1, issue is gone.
Perhaps this helps.
If issue will re-apper after some time (for some strange reason), I will report back with all the logs.
Best regards,
Bostjan

pdk27 · 2023-10-13T05:02:41Z

Interesting details! Thanks for sharing @bzoks.

davorceman · 2023-10-17T12:55:07Z

I have the same issue with 2.414.2 and 3.0.1

Linux Controller and Windows Agent

wmcbroomd2d · 2023-10-17T16:15:14Z

I had the same experience as bzoks. Downgrading to 3.0.0 fixed the problem. Then when I upgrading back to 3.0.1 the issue was gone.

mrtaxi · 2023-10-24T15:08:58Z

I had the same issue. Jenkins Version 2.414.2. Downgrading to 3.0.0 fixed the problem. AWS linux master and spot cloud agent on linux

icep87 · 2023-10-26T18:11:35Z

We are also experiencing the issue and as recommended by others downgrading to 3.0.0 fixed the problems. The logs don't say anything special except for:

label [powerful]: queueLength 10 availableCapacity 10 (availableExecutors 0 plannedCapacitySnapshot 10 additionalPlannedCapacity 0)
Oct 25, 2023 11:53:02 AM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [powerful]: No excess workload, provisioning not needed.

This is logs when the all available agents were busy and the queue was still large.
The NoDelayProvisionStrategy didn't spin up any new nodes.

And in situations where no nodes were online jobs, could be stuck in queue for hours. This is mostly visible for us during nights when mostly scheduled jobs are starting.

vineeth-bandi · 2023-11-01T16:45:45Z

We have decided to revert the changes that were part of this previous release as we were unable to reproduce and fully evaluate why these issues were appearing. Issue #417 will be tracking these changes if we decide to reintroduce them. Feel free to move discussion to that issue, or reopen this issue if reverting these changes by updating to the newest release (version 3.0.2) still causes issues.

carpool-michael added the bug label Oct 5, 2023

vineeth-bandi self-assigned this Oct 5, 2023

vineeth-bandi added the can't repro label Oct 5, 2023

vineeth-bandi removed the can't repro label Oct 9, 2023

vineeth-bandi mentioned this issue Oct 31, 2023

Revert "Update constructor to follow CasC best practices" #416

Merged

vineeth-bandi closed this as completed Nov 1, 2023

vineeth-bandi mentioned this issue Nov 1, 2023

Unable to set label on node caused by NullPointerException #414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimum Cluster Size = 0 will not spawn any agent #411

Minimum Cluster Size = 0 will not spawn any agent #411

carpool-michael commented Oct 5, 2023

vineeth-bandi commented Oct 5, 2023

amainwaring commented Oct 6, 2023

amainwaring commented Oct 6, 2023

vineeth-bandi commented Oct 9, 2023

kt315ua commented Oct 9, 2023 •

edited

Loading

willthames commented Oct 10, 2023

amainwaring commented Oct 10, 2023 •

edited

Loading

vineeth-bandi commented Oct 10, 2023

amainwaring commented Oct 10, 2023

bzoks commented Oct 10, 2023

pdk27 commented Oct 11, 2023

amainwaring commented Oct 11, 2023

pdk27 commented Oct 12, 2023

bzoks commented Oct 12, 2023

pdk27 commented Oct 13, 2023

davorceman commented Oct 17, 2023 •

edited

Loading

wmcbroomd2d commented Oct 17, 2023

mrtaxi commented Oct 24, 2023 •

edited

Loading

icep87 commented Oct 26, 2023

vineeth-bandi commented Nov 1, 2023 •

edited

Loading

Minimum Cluster Size = 0 will not spawn any agent #411

Minimum Cluster Size = 0 will not spawn any agent #411

Comments

carpool-michael commented Oct 5, 2023

Issue Details

Environment Details

vineeth-bandi commented Oct 5, 2023

amainwaring commented Oct 6, 2023

amainwaring commented Oct 6, 2023

vineeth-bandi commented Oct 9, 2023

kt315ua commented Oct 9, 2023 • edited Loading

willthames commented Oct 10, 2023

amainwaring commented Oct 10, 2023 • edited Loading

vineeth-bandi commented Oct 10, 2023

amainwaring commented Oct 10, 2023

bzoks commented Oct 10, 2023

pdk27 commented Oct 11, 2023

amainwaring commented Oct 11, 2023

pdk27 commented Oct 12, 2023

bzoks commented Oct 12, 2023

pdk27 commented Oct 13, 2023

davorceman commented Oct 17, 2023 • edited Loading

wmcbroomd2d commented Oct 17, 2023

mrtaxi commented Oct 24, 2023 • edited Loading

icep87 commented Oct 26, 2023

vineeth-bandi commented Nov 1, 2023 • edited Loading

kt315ua commented Oct 9, 2023 •

edited

Loading

amainwaring commented Oct 10, 2023 •

edited

Loading

davorceman commented Oct 17, 2023 •

edited

Loading

mrtaxi commented Oct 24, 2023 •

edited

Loading

vineeth-bandi commented Nov 1, 2023 •

edited

Loading