Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum Cluster Size = 0 will not spawn any agent #411

Closed
carpool-michael opened this issue Oct 5, 2023 · 20 comments
Closed

Minimum Cluster Size = 0 will not spawn any agent #411

carpool-michael opened this issue Oct 5, 2023 · 20 comments
Assignees
Labels

Comments

@carpool-michael
Copy link

Issue Details

Describe the bug
Since v3.0.1, with the following configuration the plugin won't start an agent:
Max Idle Minutes Before Scaledown = 1
Minimum Cluster Size = 0
Maximum Cluster Size = 1

If I rollback to v3.0.0, it will work again as expected

Environment Details

Plugin Version?
3.0.1

Jenkins Version?
2.401.1

Spot Fleet or ASG?
ASG

Label based fleet?
Yes

Linux or Windows?
Linux Controller
Windows Agents

EC2Fleet Configuration as Code
``
clouds:

  • eC2Fleet:
    addNodeOnlyIfRunning: false
    alwaysReconnect: true
    awsCredentialsId: ***
    cloudStatusIntervalSec: 10
    computerConnector:
    sSHConnector:
    credentialsId: ***
    javaPath: "D:\Jenkins\jdk-11\bin\java.exe"
    launchTimeoutSeconds: 60
    maxNumRetries: 10
    port: 22
    prefixStartSlaveCmd: "cd /d D:\ &&"
    retryWaitTime: 15
    sshHostKeyVerificationStrategy:
    manuallyTrustedKeyVerificationStrategy:
    requireInitialManualTrust: false
    disableTaskResubmit: false
    fleet: "JenkinsWindowsAgents"
    fsRoot: "D:\Jenkins"
    idleMinutes: 1
    initOnlineCheckIntervalSec: 15
    initOnlineTimeoutSec: 180
    labelString: "windows"
    maxSize: 1
    maxTotalUses: -1
    minSize: 0
    minSpareSize: 0
    name: "Windows Fleet"
    noDelayProvision: false
    numExecutors: 1
    privateIpUsed: true
    region: "eu-west-3"
    restrictUsage: true
    scaleExecutorsByWeight: false
    ``

Anything else unique about your setup?
No

@vineeth-bandi vineeth-bandi self-assigned this Oct 5, 2023
@vineeth-bandi
Copy link
Collaborator

I wasn't able to reproduce this bug with v3.0.1 plugin and Jenkins version 2.401.1. Do you have a job scheduled with windows set as the label?

Note that the default value of minSpareSize is 0, which means that no nodes are provisioned if a job with a matching label is not actively running or in queue. Alternatively, if restrictUsage is set to false then any jobs can run with this cloud.

If these are not applicable/you already tried them, can you include the logs emitted by Jenkins?

@amainwaring
Copy link

We are also having issues scaling, but our min size is 1. We have restrictUsage set so only jobs that match the specific labels can run on each ASG. One ASG has minSpareSize set to 1 and the other ASG has it 0, but neither can scale. I see messages like the following in the logs, but there are a large number of items waiting in the queue. If I hover over the jobs I see they are all waiting on the one specific worker, but if I revert the plugin to 3.0.0 and hover over a job I can see that it's waiting other nodes in the fleet, rather than just the single instance.
image

Oct 04, 2023 11:30:09 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [windows2019]: No excess workload, provisioning not needed.
Oct 04, 2023 11:30:09 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [windows||windows2019]: No excess workload, provisioning not needed.
Oct 04, 2023 11:30:19 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [windows2019]: No excess workload, provisioning not needed.
Oct 04, 2023 11:30:19 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [windows||windows2019]: No excess workload, provisioning not needed.
Oct 04, 2023 11:30:24 PM INFO com.amazon.jenkins.ec2fleet.EC2RetentionStrategy postJobAction

@amainwaring
Copy link

It looks like 3.0.1 creates new nodes when jobs first get in the queue, but then the jobs don't actually execute on the new nodes and only queue on the one existing node.

@vineeth-bandi
Copy link
Collaborator

@amainwaring
I still was not able to reproduce this bug. I created two clouds with cloud1 having minSize = 0 and minSpareSize = 0 and cloud2 having minSize = 0 and minSpareSize = 1 both with different ASG's and labels. I then created a job that runs with label1 while only cloud2 has a node running for label label2. I observe a new instance being created and a new node being created after Jenkins successfully connects to the new instance. The job is successfully run on the node running jobs for label1 managed by cloud1.

It is possible that there is some differences in our setups that create the conditions for this bug. Can you share your CasC config from Manage Jenkins -> Configuration as Code -> View Configuration (be sure to redact information as needed). As well as the logs from at least one offline node Manage Jenkins -> Nodes and Clouds -> SpecificNode -> Log. Thanks.

@kt315ua
Copy link

kt315ua commented Oct 9, 2023

Confirm topic starter bug.
Fleet instances doesn't want to bring up.

Oct 09 17:16:03 jenkins.host jenkins[3153713]: 2023-10-09 14:16:03.199+0000 [id=55]        INFO     
c.a.j.e.NoDelayProvisionStrategy#apply: label [fleet-slave]: No excess workload, provisioning not needed.

Also, in case of next config:

Max Idle Minutes Before Scaledown = 5 min
Minimum Cluster Size = 3
Maximum Cluster Size = 9
Minimum Spare Size = 0
Maximum Total Uses = -1

I've got 3 fleets for 3 jobs. And 4th job stay in queue with state Waiting for next available executor on ‘[fleet-slave]'.
4th fleet aren't coming up.

Plugin Version - 3.0.1
Jenkins Version - 2.414.2

PS: Reverting plug-in to version 3.0.0 fix this problem.

@willthames
Copy link

We've noticed similar. It seems that the NodeProvisioner thinks that there is sufficient available planned capacity for the fleet and doesn't need to do anything (I don't really know where that planned capacity number comes from):

Consulting com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy@5f18016f provisioning strategy with state StrategyState{label=REDACTED, snapshot=LoadStatisticsSnapshot{definedExecutors=0, onlineExecutors=0, connectingExecutors=0, busyExecutors=0, idleExecutors=0, availableExecutors=0, queueLength=2}, plannedCapacitySnapshot=2, additionalPlannedCapacity=0}
label [REDACTED]: queueLength 2 availableCapacity 2 (availableExecutors 0 plannedCapacitySnapshot 2 additionalPlannedCapacity 0)
label [REDACTED]: No excess workload, provisioning not needed.
Provisioning strategy com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy@5f18016f declared provisioning complete
ran update on REDACTED in 0ms

We don't have CasC set up on our Jenkins instance yet, so can't export the configuration as code but our setup is similar to other reports:

  • min size 0
  • max size 2
  • min spare size 0
  • max idle 2 minutes
  • max total uses 10
  • max initial connection timeout 3 minutes
  • cloud status interval 10 seconds

@amainwaring
Copy link

amainwaring commented Oct 10, 2023

@willthames you can probably just go to https://jenkins/configuration-as-code/ to export it. I just tried to export mine though and got the following error, which may or may-not be related to this issue..

 clouds: |-
    FAILED TO EXPORT
    hudson.model.Hudson#clouds: java.lang.NullPointerException
      at com.amazon.jenkins.ec2fleet.EC2FleetCloud.getInitOnlineCheckIntervalSec(EC2FleetCloud.java:301)
    Caused: java.lang.reflect.InvocationTargetException
      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.base/java.lang.reflect.Method.invoke(Method.java:566)
      at io.jenkins.plugins.casc.Attribute._getValue(Attribute.java:469)
    Caused: io.jenkins.plugins.casc.ConfiguratorException: Can't read attribute 'initOnlineCheckIntervalSec' from com.amazon.jenkins.ec2fleet.EC2FleetCloud@124427bb
      at io.jenkins.plugins.casc.Attribute._getValue(Attribute.java:480)
      at io.jenkins.plugins.casc.Attribute.getValue(Attribute.java:233)
      at io.jenkins.plugins.casc.Attribute.equals(Attribute.java:339)

edit: ^^ might be because this is in the lab, I've reverted the plugin to 3.0.0 but haven't restarted yet. I tried from a production Jenkins that I've rolled back to 3.0.0 and restarted and got the following:

clouds:
  - eC2Fleet:
      cloudStatusIntervalSec: 60
      computerConnector:
        sSHConnector:
          credentialsId: "jenkins"
          launchTimeoutSeconds: 1200
          maxNumRetries: 30
          port: 22
          prefixStartSlaveCmd: "java -version && D: &&"
          retryWaitTime: 15
          sshHostKeyVerificationStrategy: "nonVerifyingKeyVerificationStrategy"
      fleet: "jenkins-asg-small"
      fsRoot: "D:\\jenkins"
      idleMinutes: 5
      initOnlineTimeoutSec: 1200
      labelString: "QuickJobs asgcleanup"
      maxSize: 8
      minSize: 1
      name: "jenkins-asg-small"
      noDelayProvision: true
      numExecutors: 6
      privateIpUsed: true
      region: "us-east-1"
      restrictUsage: true
  - eC2Fleet:
      cloudStatusIntervalSec: 60
      computerConnector:
        sSHConnector:
          credentialsId: "jenkins"
          launchTimeoutSeconds: 1200
          maxNumRetries: 30
          port: 22
          prefixStartSlaveCmd: "java -version && D: &&"
          retryWaitTime: 15
          sshHostKeyVerificationStrategy: "nonVerifyingKeyVerificationStrategy"
          tcpNoDelay: false
      fleet: "jenkins-asg"
      fsRoot: "D:\\jenkins"
      idleMinutes: 5
      initOnlineTimeoutSec: 1200
      labelString: "windows windows2019 SiteLord asgcleanup"
      maxSize: 24
      minSize: 1
      minSpareSize: 1
      name: "jenkins-asg"
      noDelayProvision: true
      numExecutors: 6
      privateIpUsed: true
      region: "us-east-1"
      restrictUsage: true

@vineeth-bandi
Copy link
Collaborator

@amainwaring
Just so I understand and can investigate properly, when you first encountered this bug did you upgrade from 3.0.0 to 3.0.1 without a restart to Jenkins? If so, I have a feeling that the most recent changes might require a restart or a manual save of the cloud to force it to be recreated.

If this is the cause issue, I will see if there is a workaround as both of these are not preferred behavior. If I still cannot recreate it by starting with plugin version 3.0.0 and upgrading to 3.0.1, is there a way for you to see if Configure Clouds -> Save or a restart of Jenkins fixes your issue?

@amainwaring
Copy link

Hi @vineeth-bandi! Thanks for taking a look at this!

I definitely restarted Jenkins after updating from 3.0.0 to 3.0.1, but when I downgraded our lab environment there were a lot of jobs running and I haven't had a chance to restart it yet. Prod has a bit of a different workload so I was able to properly restart those after downgrading the plugin.

When I was using 3.0.1 I tried updating the fleet config, changing min/max values, and restarting Jenkins. I did notice one time I updated the fleet to have min 10 and max 24, and it did scale out properly to 24 nodes. But after it scaled back down to 1 it didn't scale back up again properly. We reverted to 3.0.0 pretty quickly after that though so didn't have a lot of time to investigate much.

@bzoks
Copy link

bzoks commented Oct 10, 2023

Hi, I'm also hit by this bug, which went away with a downgrade from 3.0.1 to 3.0.0, with very similar conditions:

  • min cluster size 0
  • max cluster size 4
  • min spare size 0
  • number of executors 1
  • max idle minutes before scaledown > 0 (2)
  • restricted usage to jobs with label
    but using ec2 spot fleet, linux controller and agents.
    Running single pipeline with parallel stages, one stage per node (4 stages = 4 nodes).
    When we run this pipeline, when it works, it starts 4 nodes. With 3.0.1, it starts ZERO nodes, we left it for over 30 minutes and no change.
    Interesting observation: I started one node manually on EC2 (configured target capacity on Spot Instance to 1), it was recognized by Jenkins and used - but only this one, no scaling towards max 4.

@pdk27
Copy link
Collaborator

pdk27 commented Oct 11, 2023

@amainwaring Thanks for the details. Can you share logs from the scenario you described? If you can generate logs with the logger configuration in the screenshot, that will be very helpful:
image

@amainwaring
Copy link

@pdk27 I'm really sorry, I've reverted everything back to 3.0.0 so we were able to scale again. Generally we are scaling up Jenkins at night, so when this has happened I've been engaged after-hours and was primarily more concerned with making sure we had enough workers rather than gathering logs.

@pdk27
Copy link
Collaborator

pdk27 commented Oct 12, 2023

@amainwaring Makes sense.

@bzoks @willthames @carpool-michael @kt315ua Can you please share your logs with the logger configuration above? It will help us troubleshoot the issue as we are unable to reproduce it.

@bzoks
Copy link

bzoks commented Oct 12, 2023

I upgraded again, activated logging,... but the issue did NOT happen again. Tried also restarts, multiple runs,... no way to reproduce anymore.
I'm guessing that it MIGHT be something related to configuration changes across major versions: I upgraded to 3.0.1 from one of the 2.* versions (don't know exactly)... and then the issue was present, but after downgrade to 3.0.0 and upgrade back to 3.0.1, issue is gone.
Perhaps this helps.
If issue will re-apper after some time (for some strange reason), I will report back with all the logs.
Best regards,
Bostjan

@pdk27
Copy link
Collaborator

pdk27 commented Oct 13, 2023

Interesting details! Thanks for sharing @bzoks.

@davorceman
Copy link

davorceman commented Oct 17, 2023

I have the same issue with 2.414.2 and 3.0.1

Linux Controller and Windows Agent

@wmcbroomd2d
Copy link

I had the same experience as bzoks. Downgrading to 3.0.0 fixed the problem. Then when I upgrading back to 3.0.1 the issue was gone.

@mrtaxi
Copy link

mrtaxi commented Oct 24, 2023

I had the same issue. Jenkins Version 2.414.2. Downgrading to 3.0.0 fixed the problem. AWS linux master and spot cloud agent on linux

@icep87
Copy link

icep87 commented Oct 26, 2023

We are also experiencing the issue and as recommended by others downgrading to 3.0.0 fixed the problems. The logs don't say anything special except for:

label [powerful]: queueLength 10 availableCapacity 10 (availableExecutors 0 plannedCapacitySnapshot 10 additionalPlannedCapacity 0)
Oct 25, 2023 11:53:02 AM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [powerful]: No excess workload, provisioning not needed.

This is logs when the all available agents were busy and the queue was still large.
The NoDelayProvisionStrategy didn't spin up any new nodes.

And in situations where no nodes were online jobs, could be stuck in queue for hours. This is mostly visible for us during nights when mostly scheduled jobs are starting.

@vineeth-bandi
Copy link
Collaborator

vineeth-bandi commented Nov 1, 2023

We have decided to revert the changes that were part of this previous release as we were unable to reproduce and fully evaluate why these issues were appearing. Issue #417 will be tracking these changes if we decide to reintroduce them. Feel free to move discussion to that issue, or reopen this issue if reverting these changes by updating to the newest release (version 3.0.2) still causes issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests