Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-49707] If a KubernetesComputer disconnects, remove the KubernetesSlave #461

Merged
merged 23 commits into from
Jul 10, 2019
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
99ad1f2
[JENKINS-49707] If a KubernetesComputer disconnects, remove the Kuber…
jglick Apr 30, 2019
cae2572
workflow-step-api-plugin.version
jglick Apr 30, 2019
eed014f
Updated to https://github.com/jenkinsci/workflow-durable-task-step-pl…
jglick Apr 30, 2019
c3c98a1
Incremental deployment.
jglick Apr 30, 2019
4510310
JDK 11 Javadoc failure.
jglick Apr 30, 2019
8569e83
Setting surefire.rerunFailingTestsCount to 0 not 1.
jglick May 1, 2019
f11a182
Copy-pasta.
jglick May 1, 2019
ce032c6
Redesigned fix to remove agents when a pod is deleted, rather than me…
jglick May 1, 2019
61307bf
We will not in general be permitted to watch pods at cluster scope, s…
jglick May 1, 2019
6da8cf6
Merge branch 'master' into removingAgentIsFatal-JENKINS-49707
jglick Jun 5, 2019
2c3b39e
Bump.
jglick Jun 5, 2019
5b30251
RequireUpperBoundDeps
jglick Jun 5, 2019
3f00eb1
Need to update workflow-cps to interpret DynamicContext.
jglick Jun 5, 2019
6f60c9e
Merge branch 'master' into removingAgentIsFatal-JENKINS-49707
jglick Jun 11, 2019
b9b5f9c
Merge branch 'master' into removingAgentIsFatal-JENKINS-49707
jglick Jun 11, 2019
210e0f7
Test flake pending #496.
jglick Jun 11, 2019
a25d24f
Bump.
jglick Jun 11, 2019
e71a5c5
Senseless to even try to build against _older_ LTS lines than our min…
jglick Jun 11, 2019
3c49b62
Merge branch 'master' into removingAgentIsFatal-JENKINS-49707
jglick Jun 13, 2019
a5269d3
Merge branch 'master' into removingAgentIsFatal-JENKINS-49707
jglick Jul 2, 2019
70153a9
workflow-durable-task-step 2.32
jglick Jul 5, 2019
a4cccb7
Merge branch 'master' into removingAgentIsFatal-JENKINS-49707
jglick Jul 5, 2019
bb4e297
Using improved assertion after #496.
jglick Jul 5, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion Jenkinsfile
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
buildPlugin(configurations: buildPlugin.recommendedConfigurations().findAll { it.platform == 'linux' })
buildPlugin(configurations: [
jglick marked this conversation as resolved.
Show resolved Hide resolved
[platform: 'linux', jdk: '8', jenkins: null],
[platform: 'linux', jdk: '11', jenkins: null],
])
10 changes: 6 additions & 4 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
<groupId>org.jenkins-ci.plugins</groupId>
<artifactId>plugin</artifactId>
<version>3.43</version>
<relativePath />
</parent>

<groupId>org.csanchez.jenkins.plugins</groupId>
Expand Down Expand Up @@ -45,11 +46,12 @@
<connectorHost />
<jenkins.host.address />
<java.level>8</java.level>
<jenkins.version>2.138.4</jenkins.version>
<jenkins.version>2.176.1</jenkins.version>
<no-test-jar>false</no-test-jar>
<useBeta>true</useBeta>
<surefire.rerunFailingTestsCount>0</surefire.rerunFailingTestsCount>
<pipeline-model-definition.version>1.3.7</pipeline-model-definition.version>
<workflow-support-plugin.version>3.3</workflow-support-plugin.version>
<workflow-step-api-plugin.version>2.20</workflow-step-api-plugin.version>
<slf4jVersion>1.7.26</slf4jVersion>
</properties>
Expand Down Expand Up @@ -144,19 +146,19 @@
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-support</artifactId>
<version>3.3</version>
<version>${workflow-support-plugin.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-durable-task-step</artifactId>
<version>2.28</version>
<version>2.32-rc934.23119201f0b5</version> <!-- TODO https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/104 -->
jglick marked this conversation as resolved.
Show resolved Hide resolved
<scope>test</scope>
</dependency>
<dependency> <!-- SemaphoreStep -->
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-support</artifactId>
<version>3.0</version>
<version>${workflow-support-plugin.version}</version>
<classifier>tests</classifier>
<scope>test</scope>
</dependency>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
/*
* Copyright 2019 CloudBees, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.csanchez.jenkins.plugins.kubernetes.pod.retention;

import hudson.Extension;
import hudson.model.Computer;
import hudson.model.Node;
import hudson.model.TaskListener;
import hudson.slaves.Cloud;
import hudson.slaves.ComputerListener;
import hudson.slaves.EphemeralNode;
import io.fabric8.kubernetes.api.model.Pod;
import io.fabric8.kubernetes.client.KubernetesClient;
import io.fabric8.kubernetes.client.KubernetesClientException;
import io.fabric8.kubernetes.client.Watcher;
import java.io.IOException;
import java.util.ArrayList;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.logging.Level;
import java.util.logging.Logger;
import jenkins.model.Jenkins;
import org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud;
import org.csanchez.jenkins.plugins.kubernetes.KubernetesComputer;
import org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave;

/**
* Checks for deleted pods corresponding to {@link KubernetesSlave} and ensures the node is removed from Jenkins too.
* <p>If the pod has been deleted, all of the associated state (running user processes, workspace, etc.) must also be gone;
* so there is no point in retaining this agent definition any further.
* ({@link KubernetesSlave} is not an {@link EphemeralNode}: it <em>does</em> support running across Jenkins restarts.)
* <p>Note that pod retention policies other than the default {@link Never} may disable this system,
* unless some external process or garbage collection policy results in pod deletion.
*/
@Extension
public class Reaper extends ComputerListener implements Watcher<Pod> {

private static final Logger LOGGER = Logger.getLogger(Reaper.class.getName());

/**
* Activate this feature only if and when some Kubernetes agent is actually used.
* Avoids touching the API server when this plugin is not even in use.
*/
private final AtomicBoolean activated = new AtomicBoolean();

@Override
public void onOnline(Computer c, TaskListener listener) throws IOException, InterruptedException {
if (c instanceof KubernetesComputer && activated.compareAndSet(false, true)) {
activate();
}
}

private void activate() {
LOGGER.fine("Activating reaper");
// First check all existing nodes to see if they still have active pods.
// (We may have missed deletion events while Jenkins was shut off,
// or pods may have been deleted before any Kubernetes agent was brought online.)
for (Node n : new ArrayList<>(Jenkins.get().getNodes())) {
if (!(n instanceof KubernetesSlave)) {
continue;
}
KubernetesSlave ks = (KubernetesSlave) n;
String ns = ks.getNamespace();
String name = ks.getPodName();
try {
// TODO more efficient to do a single (or paged) list request, but tricky since there may be multiple clouds,
// and even within a single cloud an agent pod is permitted to use a nondefault namespace,
// yet we do not want to do an unnamespaced pod list for RBAC reasons.
// Could use a hybrid approach: first list all pods in the configured namespace for all clouds;
// then go back and individually check any unmatched agents with their configured namespace.
if (ks.getKubernetesCloud().connect().pods().inNamespace(ns).withName(name).get() == null) {
LOGGER.info(() -> ns + "/" + name + " seems to have been deleted, so removing corresponding Jenkins agent");
Jenkins.get().removeNode(ks);
} else {
LOGGER.fine(() -> ns + "/" + name + " still seems to exist, OK");
}
} catch (Exception x) {
LOGGER.log(Level.WARNING, "failed to do initial reap check for " + ns + "/" + name, x);
}
}
// Now set up a watch for any subsequent pod deletions.
for (Cloud c : Jenkins.get().clouds) {
if (!(c instanceof KubernetesCloud)) {
continue;
}
KubernetesCloud kc = (KubernetesCloud) c;
try {
KubernetesClient client = kc.connect();
client.pods().inNamespace(client.getNamespace()).watch(this);
} catch (Exception x) {
LOGGER.log(Level.WARNING, "failed to set up watcher on " + kc.getDisplayName(), x);
}
}
}

@Override
public void eventReceived(Watcher.Action action, Pod pod) {
if (action == Watcher.Action.DELETED) {
String ns = pod.getMetadata().getNamespace();
String name = pod.getMetadata().getName();
for (Node n : new ArrayList<>(Jenkins.get().getNodes())) {
if (!(n instanceof KubernetesSlave)) {
continue;
}
KubernetesSlave ks = (KubernetesSlave) n;
if (ks.getNamespace().equals(ns) && ks.getPodName().equals(name)) {
LOGGER.info(() -> ns + "/" + name + " was just deleted, so removing corresponding Jenkins agent");
try {
Jenkins.get().removeNode(ks);
return;
} catch (Exception x) {
LOGGER.log(Level.WARNING, "failed to reap " + ns + "/" + name, x);
}
}
}
LOGGER.fine(() -> "received deletion notice for " + ns + "/" + name + " which does not seem to correspond to any Jenkins agent");
}
}

@Override
public void onClose(KubernetesClientException cause) {
// TODO ignore, or do we need to manually reattach the watcher?
// AllContainersRunningPodWatcher is not reattached, but this is expected to be short-lived,
// useful only until the containers of a single pod start running.
// (At least when using kubernetes-client/java, the connection gets closed after 2m on GKE
// and you need to rerun the watch. Does the fabric8io client wrap this?)
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -30,30 +30,30 @@
import static org.junit.Assert.*;

import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.stream.Collectors;

import hudson.model.Result;
import io.fabric8.kubernetes.api.model.Pod;
import io.fabric8.kubernetes.api.model.PodList;
import org.csanchez.jenkins.plugins.kubernetes.PodAnnotation;
import org.csanchez.jenkins.plugins.kubernetes.PodTemplate;
import org.jenkinsci.plugins.workflow.cps.CpsFlowDefinition;
import org.jenkinsci.plugins.workflow.job.WorkflowJob;
import org.jenkinsci.plugins.workflow.job.WorkflowRun;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution;
import org.jenkinsci.plugins.workflow.test.steps.SemaphoreStep;
import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TemporaryFolder;
import org.jvnet.hudson.test.Issue;
import org.jvnet.hudson.test.JenkinsRule;
import org.jvnet.hudson.test.JenkinsRuleNonLocalhost;

import hudson.model.Result;
import java.util.Locale;
import org.jvnet.hudson.test.Issue;

/**
* @author Carlos Sanchez
*/
Expand Down Expand Up @@ -356,4 +356,17 @@ public void runInPodWithRetention() throws Exception {
r.assertBuildStatusSuccess(r.waitForCompletion(b));
assertTrue(deletePods(cloud.connect(), getLabels(this, name), true));
}

@Issue("JENKINS-49707")
@Test
public void terminatedPod() throws Exception {
r.waitForMessage("+ sleep", b);
deletePods(cloud.connect(), getLabels(this, name), false);
r.assertBuildStatus(Result.ABORTED, r.waitForCompletion(b));
// TODO could use waitForMessage after #496
while (!JenkinsRule.getLog(b).contains(new ExecutorStepExecution.RemovedNodeCause().getShortDescription())) {
Thread.sleep(100);
}
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,15 @@
import org.jvnet.hudson.test.RestartableJenkinsNonLocalhostRule;

import hudson.model.Node;
import hudson.model.Result;
import hudson.slaves.DumbSlave;
import hudson.slaves.JNLPLauncher;
import hudson.slaves.NodeProperty;
import hudson.slaves.RetentionStrategy;
import jenkins.model.JenkinsLocationConfiguration;
import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution;
import org.jvnet.hudson.test.Issue;
import org.jvnet.hudson.test.JenkinsRule;

public class RestartPipelineTest {
protected static final String CONTAINER_ENV_VAR_VALUE = "container-env-var-value";
Expand Down Expand Up @@ -187,6 +191,37 @@ public void runInPodWithRestart() throws Exception {
});
}

@Issue("JENKINS-49707")
@Test
public void terminatedPodAfterRestart() throws Exception {
story.then(r -> {
configureCloud();
WorkflowJob p = r.jenkins.createProject(WorkflowJob.class, "p");
p.setDefinition(new CpsFlowDefinition(loadPipelineScript("terminatedPodAfterRestart.groovy"), true));
WorkflowRun b = p.scheduleBuild2(0).waitForStart();
r.waitForMessage("+ sleep", b);
});
story.then(r -> {
WorkflowRun b = r.jenkins.getItemByFullName("p", WorkflowJob.class).getBuildByNumber(1);
r.waitForMessage("Ready to run", b);
// Note that the test is cheating here slightly.
// The watch in Reaper is still running across the in-JVM restarts,
// whereas in production it would have been cancelled during the shutdown.
// But it does not matter since we are waiting for the agent to come back online after the restart,
// which is sufficient trigger to reactivate the reaper.
// Indeed we get two Reaper instances running, which independently remove the node.
deletePods(cloud.connect(), getLabels(this, name), false);
r.assertBuildStatus(Result.ABORTED, r.waitForCompletion(b));
while (!JenkinsRule.getLog(b).contains(new ExecutorStepExecution.RemovedNodeCause().getShortDescription())) {
// TODO JenkinsRule.waitForMessage has a race condition w.r.t. the termination cause printed by WorkflowRun.finish
jglick marked this conversation as resolved.
Show resolved Hide resolved
Thread.sleep(100);
}
// Currently the logic in ExecutorStepExecution cannot handle a Jenkins restart so it prints the following.
// It does not matter since DurableTaskStep redundantly implements the same check.
r.assertLogContains(" was deleted, but do not have a node body to cancel", b);
});
}

@Test
public void getContainerLogWithRestart() throws Exception {
story.then(r -> {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
podTemplate(label: 'terminatedPod', containers: [
containerTemplate(name: 'busybox', image: 'busybox', ttyEnabled: true, command: '/bin/cat'),
]) {
node ('terminatedPod') {
container('busybox') {
sh 'sleep 9999999'
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
package org.csanchez.jenkins.plugins.kubernetes.pipeline

podTemplate(label: 'terminatedPodAfterRestart', containers: [
containerTemplate(name: 'busybox', image: 'busybox', ttyEnabled: true, command: '/bin/cat'),
]) {
node ('terminatedPodAfterRestart') {
container('busybox') {
sh 'sleep 9999999'
}
}
}