Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup/Restore issue with Multibranch plugin #104

Closed
blucas opened this issue Sep 16, 2019 · 27 comments
Closed

Backup/Restore issue with Multibranch plugin #104

blucas opened this issue Sep 16, 2019 · 27 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@blucas
Copy link

blucas commented Sep 16, 2019

I'm using Jenkins Operator v0.2.0 and backup-pvc v0.0.6

When restoring a Bitbucket Team job (Organization Folder) using the cloudbees-bitbucket-branch-source:2.4.6 plugin, the operator fails to "fully" restore the folder and jobs. After the restore completes and when you navigate to the Bitbucket Team job in the Jenkins UI, it displays as if it has yet to scan the Bitbucket Team and all of that team's repositories / branches.

I checked the backup file. It does contain the builds for each repository/branch and did successfully restore them on disk. The only way to get the builds to display after restore is to issue a new "Scan Organization Folder Now" operation on the Organization Folder. But this creates further issues as that scan assumes it's the first and so resets nextBuildNumber to 1.

The issue is that the plugin or framework expects that the job's config.xml file is also restored. If I remove --exclude jobs/*/config.xml from backup.sh and trigger a restore, then everything displays as expected in the UI. Jenkins knows what the nextBuildNumber should be for all jobs under the Organization Folder.

I would like to suggest that exclusion of the config.xml file become a configurable setting. For example add an env_var EXCLUDE_CONFIG_XML=true|false whereby the default is true and one can disable that functionality on the backup container by providing a false value when configuring the container's environment variables

@tomaszsek
Copy link

Hi @blucas

How you create your BitBucket jobs? Do you run groovy scripts or Configuration as Code?
You can force Jenkins to reload configuration from disk by running groovy script Jenkins.instance.reload().

Cheers

@blucas
Copy link
Author

blucas commented Sep 16, 2019

Hi @tomaszsek

We create a single job using JobDSL via the seed-job agent. The DSL is below with some sections omitted. I am aware of the reload functionality. I did issue that command, but it still didn't display the folders/jobs under my team/organization folder.

organizationFolder('my-bitbucket-team') {
    description('This contains branch source jobs for Bitbucket Team: My Bitbucket Team')
    displayName('My Bitbucket Team')
    triggers {
        periodic(30) // minutes
    }

    organizations {
        bitbucket {
            // Credentials MUST be of type USERNAME-PASSWORD
            credentialsId('bitbucket-credentials')
            repoOwner('my-bitbucket-team')
            traits {
            }
        }
    }

    // We need to configure this stuff by hand until JobDSL gets some support (https://github.com/jenkinsci/git-plugin/pull/595)
    configure { node ->
        def traits = node / navigators / 'com.cloudbees.jenkins.plugins.bitbucket.BitbucketSCMNavigator' / traits
        traits << 'com.cloudbees.jenkins.plugins.bitbucket.BranchDiscoveryTrait' {
            strategyId(1)
        }
        traits << 'com.cloudbees.jenkins.plugins.bitbucket.OriginPullRequestDiscoveryTrait' {
            strategyId(1)
        }
        // Clone repositories using SSH
        traits << 'com.cloudbees.jenkins.plugins.bitbucket.SSHCheckoutTrait' {
            credentialsId('my-key')
        }
    }
}

@tomaszsek
Copy link

I suggest:

  • create my-bitbucket-team job throughout seed job
  • run my-bitbucket-team job in groovy script
  • wait for the my-bitbucket-team to complete
  • run Jenkins.instance.reload() groovy script

@blucas
Copy link
Author

blucas commented Sep 16, 2019

create my-bitbucket-team job throughout seed job

That's what we do (see above DSL)

run my-bitbucket-team job in groovy script

I'm not too sure what you mean by this. the seed job automatically runs this the first time it gets created.

wait for the my-bitbucket-team to complete

This is what causes the nextBuildNumber to reset

run Jenkins.instance.reload() groovy script

This does nothing other than pickup the wrong build number because the previous step has reset it.

For example, lets say My Bitbucket Team has the following repositories in it:

  • repo-1
  • repo-2
  • repo-3

Each repo has a master branch with a Jenkinsfile in it.

  1. Install Operator and Jenkins CR with a seed job which picks up the DSL I mentioned previously.
  2. Seed-agent will create job Organization Folder 'my-bitbucket-team' and trigger an "Scan Organization Folder" operation
  3. Scan will pick up the repos above and create my-bitbucket-team/repo-N folders in Jenkins
  4. Each repo-N folder will trigger a "Multibranch Pipeline Scan" to detect all branches with a Jenkinsfile
  5. The scan will generate a 'master' pipeline job for each repo-N
  6. The scan will trigger a build of each of the jobs in step 5.
  7. Suppose time passes, and repo-1/master job has 5 builds, repo-2/master has 6 builds and repo-3/master has 10 builds. (confirm builds are in backup file)
  8. Trigger a restore (for example install a plugin on the Jenkins CR yaml file).
  9. Jenkins Operator forces a new Jenkins Master to be created and restores backup file.
  10. Login to Jenkins.

You will see that the seed-job has been triggered again and created the 'my-bitbucket-team' Organization Folder. Click on that folder and you'll be presented with a page similar to the one below. If the restore had worked properly, it would list repo-1, repo-2 and repo-3 instead and if you click on them, they would list a master job for each of those repos. Each master job would list each of its builds 5, 6, and 10 respectively.
Screen Shot 2019-09-16 at 11 22 05 AM

Check the jenkins-master disk and you will see the jobs and their respective builds. You can to http://<jenkins-url>:<port>/script and run Jenkins.instance.getItemByFullName("my-bitbucket-team/repo-1/master").getNextBuildNumber() and it will fail as in-memory, Jenkins doesn't now about that job.

If you trigger a reload-from-disk (Jenkins.instance.reload()) and the UI still displays the "This folder is empty" page and the script above will still fail. Now, if you trigger a "Scan Organization Folder Now" (see screenshot) Jenkins will re-scan bitbucket and find all those repos/branches and re-trigger builds on those branch job. BUT the problem with this is Jenkins assumes it's the first build of these jobs and resets nextBuildNumber on disk. Jenkins will silently (unless you look at the Jenkins log) fail to build these jobs as there are already builds for them. At this point your configuration on disk is broken as the "Scan Organization Folder Now" functionality has reset the build number on disk.

I hope this helps clear up any confusion.

@tumevoiz
Copy link

tumevoiz commented Sep 26, 2019

Hi, @blucas

I apologize for late response, but I have a fix for your problem.

Please put this to your groovy scripts, and everything should work.

def jobName = "my-bitbucket-team"
def job = Jenkins.instance.getItem(jobName)

job.scheduleBuild()
sleep 10000
job.doReload()

If this will solve your problem, please close issue.

Cheers

@blucas
Copy link
Author

blucas commented Oct 1, 2019

I haven't had time to test this out, but I don't see how this will solve the problem. The nextBuildNumber, in theory, will still be reset to 1.

@tomaszsek
Copy link

Hi @blucas

job.scheduleBuild() will trigger the scan and job.doReload() will read the nextBuildNumber from the disk(restored by the operator).

Cheers

@blucas
Copy link
Author

blucas commented Oct 1, 2019

I still don't see how this will help, not to mention it feels more like a hack than a solution (the sleep NNNN).

scheduleBuild() will trigger a scan, that scan will reset the nextBuildNumber on disk to 2.
sleep N makes me assume we have to wait for the scan to finish, this could take seconds, or hours for larger organizations
doReload() will pickup the wrong build number on disk because of scheduleBuild().

@pawelprazak
Copy link

pawelprazak commented Oct 23, 2019

I've hit a different issue with restore and the multibranch jobs, the "Build with parameters" button no longer appears. And scanning/reindexing doesn't help like it used to.

In general I've hit multiple problems with incompatibility of multibranch plugin with configuration as code and job dsl plugins.

If at all possible, I would advice to reconsider using multibranch plugin at all.

@pawelprazak pawelprazak changed the title Backup/Restore issue Backup/Restore issue with Miltibranch plugin Oct 23, 2019
@pawelprazak pawelprazak changed the title Backup/Restore issue with Miltibranch plugin Backup/Restore issue with Multibranch plugin Oct 23, 2019
@pawelprazak
Copy link

pawelprazak commented Oct 23, 2019

After I excluded jobs/*/branches/*/config.xml I could see the "Build with parameters" button (after reindexing).

And now I can confirm @blucas 's issue, I was able to reproduce it.

#104 (comment) wont work because the jobs won't be loaded, in a best case scenario it will cause a race condition.

@pawelprazak
Copy link

On the other hand if we start to include jobs/*/config.xml I'm afraid it would cause even more problems with "rotten state" that the whole "immutable configuration as code" thing we have going one in the operator tries to solve.

I wonder are there any documented issues regarding job dsl plugin and job state backups...

@pawelprazak pawelprazak added bug Something isn't working enhancement New feature or request labels Oct 24, 2019
@pawelprazak
Copy link

I've been experimenting with a custom backup/restore provider:
https://jenkinsci.github.io/kubernetes-operator/docs/getting-started/latest/custom-backup-and-restore/

Maybe you could just overwrite the backup script in similar fashion like I did in the above example?

Also I've talked about this issue with @tomaszsek and he came up with this nice workaround of trying to detect if there is the branches directory, if yes then do not exclude the jobs/*/config.xml.

This would limit any problems with stale state to the Multibranch Jobs (that are already problematic) and not cause any additional damage to other types of jobs.

@pawelprazak
Copy link

pawelprazak commented Oct 29, 2019

I've got it to work:

I also have a job like this, just in case I need to re-index all:

import jenkins.branch.*

pipeline {
    agent none // master
    stages {
        stage('Reindex') {
            steps {
                script {
                    for (project in Jenkins.instance.getAllItems(jenkins.branch.MultiBranchProject.class)) {
                        stage(project.getName()) {
                            project.getComputation().run() // force reindexing
                        }
                    }
                    stage("Reload") {
                        Jenkins.instance.reload()
                    }
                }
            }
        }
    }
}

@dee-kryvenko
Copy link

dee-kryvenko commented Nov 7, 2019

To wait for scan to complete

    def job = Jenkins.instance.getItem(jobName)
    def scan = job.scheduleBuild2(0) // 0 = don't wait for the completion of the build
    scan.getFuture().get()
    job.doReload()

https://javadoc.jenkins-ci.org/hudson/model/AbstractProject.html#scheduleBuild2-int-

@pawelprazak
Copy link

thank you @llibicpep for a hint

re-indexing pipeline after those changes:

import jenkins.branch.*

pipeline {
    agent none // master
    stages {
        stage('Reindex') {
            steps {
                script {
                    for (project in Jenkins.instance.getAllItems(jenkins.branch.MultiBranchProject.class)) {
                        stage(project.getName()) {
                            def scan = project.scheduleBuild2(0 /* quiet period */) // force reindexing
                            scan.getFuture().get()
                            project.doReload()
                        }
                    }
                }
            }
        }
    }
}

also you need script approvals for:

staticMethod jenkins.model.Jenkins getInstance
method hudson.model.ItemGroup getAllItems java.lang.Class
method hudson.model.Item getName
method com.cloudbees.hudson.plugins.folder.computed.ComputedFolder scheduleBuild2 int hudson.model.Action[]
method hudson.model.Queue$Item getFuture
method java.util.concurrent.Future get
method hudson.model.AbstractItem doReload

@pawelprazak
Copy link

As reported in #210, when restoring a multibranch backup, the build info is missing and Jenkins job "run" silently fails with an exception in logs:

Nov 25, 2019 9:00:08 AM SEVERE hudson.model.Executor run
Executor #-1 for master: Unexpected executor death
java.lang.IllegalStateException: JENKINS-23152: /var/jenkins/home/jobs/xxxx/jobs/yyyy/branches/master/builds/3 already existed; will not overwrite with xxxx/yyyy/master #3
	at hudson.model.RunMap.put(RunMap.java:189)
	at jenkins.model.lazy.LazyBuildMixIn.newBuild(LazyBuildMixIn.java:182)
Caused: java.lang.Error
	at jenkins.model.lazy.LazyBuildMixIn.newBuild(LazyBuildMixIn.java:190)
	at jenkins.model.ParameterizedJobMixIn$ParameterizedJob.createExecutable(ParameterizedJobMixIn.java:511)
	at jenkins.model.ParameterizedJobMixIn$ParameterizedJob.createExecutable(ParameterizedJobMixIn.java:321)
	at hudson.model.Executor$1.call(Executor.java:365)
	at hudson.model.Executor$1.call(Executor.java:347)
	at hudson.model.Queue._withLock(Queue.java:1438)
	at hudson.model.Queue.withLock(Queue.java:1299)
	at hudson.model.Executor.run(Executor.java:347)

The backup.sh must be patched like in #211 for the workaround to work.

Further investigation on how we can work around multi-branch issues with the operator itself is necessary.

@pawelprazak
Copy link

pawelprazak commented Nov 26, 2019

The #211 as is right now has implications for every job not only multi-branch and it breaks immutability in some ways we cannot easily predict.

TBH we won't probably have enough bandwidth to make a proper PR for this backups issue with multi-branch before Jenkins World, so if there is some bash haxor out there a hint on how to approach it would be nice :)

@tumevoiz
Copy link

Hi, @blucas did you solved your problem?

@pbecotte
Copy link

Have been trying to give this a shot. Builds not getting scheduled after every deploy (until I push "build now" to get the current build up past the last history build number) is kind of a deal breaker- I can't imagine there is anyone actually NOT using multibranch builds with Jenkins? Is there a workaround that I am just missing?

@tomaszsek
Copy link

@pbecotte There is WIP pull request #211 which should fix the issue. The main goal is to add jobs/*/config.xml files in backup if the type of the job is multibranch.

@tomaszsek
Copy link

Fixed in v0.0.8 PVC backup provider.

@agnewp
Copy link

agnewp commented Dec 28, 2020

This issue should be re-opened. this 'fix' does not work for me for the most basic example of a multi-branch pipeline job. in specific the job doesn't show up after restoring. furthermore the assumption the fix is based on is that multi-branch pipeline jobs are the only thing that has a top-level config.xml. This is not the case at all. folders and org folders fall into this category as well and this approach will exclude backup and restore of these things as well.

@agnewp
Copy link

agnewp commented Dec 28, 2020

I have a work-around:

  1. disable operator's backup restore mechanism
  2. map the backup volume directly into your Jenkins pod on /backup
  3. install the thinbackup plugin
  4. use this groovy script in your CASC config map
1-thin-backup-restore-latest.groovy: |
    import org.jvnet.hudson.plugins.thinbackup.ThinBackupPluginImpl
    import org.jvnet.hudson.plugins.thinbackup.utils.Utils
    import org.jvnet.hudson.plugins.thinbackup.ThinBackupPeriodicWork.BackupType
    import org.jvnet.hudson.plugins.thinbackup.backup.BackupSet
    import org.jvnet.hudson.plugins.thinbackup.restore.HudsonRestore
    import java.util.logging.Logger
    import java.text.ParseException
    import java.text.SimpleDateFormat
    import java.io.File

    final Logger LOGGER = Logger.getLogger("hudson.plugins.thinbackup");

    Date getLatestFullBackupDate(String rootDirectory) {
        final List<File> fullBackups = Utils.getBackupTypeDirectories(new File(rootDirectory), BackupType.FULL);
        if ((fullBackups == null) || (fullBackups.isEmpty())) {
          return null;
        }

        Date result = new Date(0);
        for (final File fullBackup : fullBackups) {
          final Date tmp = Utils.getDateFromBackupDirectory(fullBackup);
          if (tmp != null) {
            if (tmp.after(result)) {
              result = tmp;
            }
          } //else {
          // LOGGER.log(Level.INFO, "Cannot parse directory name ' {0} ', thus ignoring it when getting latest backup date.",
          //     fullBackup.getName());
        // }
        }

        return result;
    }

    void doRestore(Date restoreFromDate, Logger LOGGER) {
        LOGGER.info("Starting restore operation (${restoreFromDate}).");

        final Jenkins jenkins = Jenkins.getInstance();
        if (jenkins == null) {
          return;
        }

        jenkins.doQuietDown();
        LOGGER.fine("Waiting until executors are idle to perform restore...");
        Utils.waitUntilIdle();

        try {
          final File hudsonHome = jenkins.getRootDir();

          final HudsonRestore hudsonRestore = new HudsonRestore(hudsonHome, ThinBackupPluginImpl.getInstance()
              .getExpandedBackupPath(), restoreFromDate, false, false);
          hudsonRestore.restore();

          LOGGER.info("Restore finished.");
        } catch (ParseException e) {
          LOGGER.severe("Cannot parse restore option. Aborting.");
        } catch (final Exception ise) {
          LOGGER.severe("Could not restore. Aborting.");
        } finally {
          jenkins.doCancelQuietDown();
        }
    }

    thinBackup = ThinBackupPluginImpl.getInstance()
    thinBackup.setBackupPath("/backup")

    latestDate = getLatestFullBackupDate("/backup")
    if (latestDate != null) {
      //run restore job
      doRestore(latestDate, LOGGER) 
    } else {
      LOGGER.info("No full backups found in backup directory");
    }

    thinBackup.setFullBackupSchedule("0 */1 * * *")
    thinBackup.setDiffBackupSchedule("*/5 * * * *")
    thinBackup.setNrMaxStoredFull(1000)
    //thinBackup.setNrMaxStoredFullAsString()
    thinBackup.setExcludedFilesRegex("")
    thinBackup.setWaitForIdle(true)
    thinBackup.setForceQuietModeTimeout(120)
    thinBackup.setBackupBuildResults(true)
    thinBackup.setBackupBuildArchive(false)
    thinBackup.setBackupBuildsToKeepOnly(false)
    thinBackup.setBackupUserContents(false)
    thinBackup.setBackupNextBuildNumber(false)
    thinBackup.setBackupPluginArchives(false)
    thinBackup.setBackupAdditionalFiles(false)
    thinBackup.setBackupAdditionalFilesRegex("")
    thinBackup.setCleanupDiff(true)
    thinBackup.setMoveOldBackupsToZipFile(false)

this approach has problems around credentials, and needing a (user initiated) reload after the Jenkins pod is restored. I would much rather use a more reliable mechanism external to Jenkins for backup and restore. My thoughts here are to look at the source of thinbackup and copy the method of backup directly into the backup script you guys have started.

@pbecotte
Copy link

I am not sure what the difference is, but this did fix it for us using GitHub organization folders. We have had quite a bit of pain from the persistence options in this syst, but our jobs do come back after restore now.

@agnewp
Copy link

agnewp commented Jan 6, 2021

i came to a realization last night that Jenkins operator on purpose does NOT restore the job configs, rather it depends on having Jenkins seed jobs in place to fully restore all the jobs configurations back to the original state. i believe this is what i'm missing from my setup. Is this a correct thought? is everyone else utilizing seed jobs on startup to fully restore job configs?

@pbecotte
Copy link

pbecotte commented Jan 6, 2021

Yes, this system is designed to provide basically immutable configuration- the whole thing is reverted to the state from the CRD on every restart.

I create my jobs with config like this (tons of detail left out!)

configurations:
  10-job-compute-config.yaml: |
      jobs:
          - script: >-
                organizationFolder('Compute-Config') {
                    description('Jenkins jobs from the Compute-Config Github organization')

                    organizations {
                       github {repoOwner('Compute-Config')}
                    }
...............

Everything is in my CRD. Secrets, org-folder jobs, RBAC, ldap, global libraries... everything.

Whether that is a good thing I am still open on- getting the manifests working right is frequently painful, and it means your instance will just not start up after a reboot some percentage of the time when some plugin that you didn't have pinned to a specific version breaks. But having the manifest get out of sync isn't the best either...

@agnewp
Copy link

agnewp commented Jan 6, 2021

okay, then your backup / restore pod is really only meant to restore the job history for example. which is why you exclude all job config.xml files. it is mean to be re-seeded from scripts called from operator after a new pod is created. i think I missed this point in the architecture description of how this is supposed to work. I am pretty new to the operational side of Jenkins, but it might help to spell this point out a little better in the documentation website for how all of this is meant to work. Thanks for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants