Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infra: spring software updates #1222

Closed
refack opened this issue Apr 9, 2018 · 15 comments
Closed

infra: spring software updates #1222

refack opened this issue Apr 9, 2018 · 15 comments

Comments

@refack
Copy link
Contributor

refack commented Apr 9, 2018

I'm not sure how to coordinate this, but it IMHO we should do some systematic updates to the software on our infra. I'm referring to peripheral software such as Java, slave.jar and git (not OS or compilers).
Besides minimizing potential bit-rot, and making us feel better in general, I have an intuiting it is already casing failures, and blocking process improvments. For example:

  1. Jenkins java.io.IOException: remote file operation failed #173 (comment) Jenkins failing to communicate with workers — might be related to stale slave.jar:
    on failing machines there was an old agent running
    image
    after restart it bumps to:
    image
    "Log" show this warning:
    image

  2. Old git (I mean 1.8 when the latest is 2.17) doesn't handle sparse checkouts, which degrades the overall performance of the cluster:
    image

  3. My estimation is that we also have on some platforms outdated sshd with potential security issues (also we should disable plain-text password login where possible RE: aix / drive overflowing #866)


Since I now have time for such tasks, I'm seeking feedback / pitfalls / warnings. And also ideas on how to coordinate such efforts (RE @gibfahn and the Java8 project).

@refack
Copy link
Contributor Author

refack commented Apr 10, 2018

A computer similar to my above example just popped (test-digitalocean-ubuntu1604-x86-1 went into a remoting perma-fail):
image
Solved by updating slave.jar

@gibfahn
Copy link
Member

gibfahn commented Apr 10, 2018

I'm not sure how to coordinate this, but it IMHO we should do some systematic updates to the software on our infra.

Sounds like a great idea to me.

Besides minimizing potential bit-rot, and making us feel better in general, I have an intuiting it is already casing failures, and blocking process improvments. For example:
Since I now have time for such tasks, I'm seeking feedback / pitfalls / warnings. And also ideas on how to coordinate such efforts (RE @gibfahn and the Java8 project).

In my opinion the biggest source of bitrot is that we don't run our Ansible scripts on the machines regularly, so we can't trust that they'll work on the machines (so we just update things manually because we don't have time etc. etc.)

My ideal update scenario is a weekly job that runs the scripts against all the machines.

If that is how we want to progress, the first step is to document the list of machines we can't use Ansible on (@rvagg has more info here), either because the scripts haven't been implemented/ported yet, or because the machines can't be updated as it will break custom things we've done to them.

@rvagg
Copy link
Member

rvagg commented Apr 11, 2018

As per today's meeting today, "Error fetching remote repo" errors seem to be fixed by upgrading git on the machines. I did a bunch of that yesterday in #1224, CentOS5 was done about a month ago by manually compiling git (doc for that is in this repo) and CentOS6 was done yesterday @ #1223.

The other error relates to git but I the stacktrace suggests it's more to do with the remote call mechanism of Jenkins. We've had these errors for a long time and they seem to have been solved variously by: restarting jenkins, restarting machines, clearing workspaces, upgrading slave.jar and upgrading java.

As I've already mentioned, I can't solve the error on one of the two smartos16 machines so I took it offline this week: https://ci.nodejs.org/computer/test-joyent-smartos16-x64-2/, the only thing I haven't tried is changing the Java version is used but I'm not sure I can even do that on SmartOS.

@rvagg
Copy link
Member

rvagg commented Apr 11, 2018

Oh and re updating slave.jar, I'd be happy to see that done as part of the init/upstart/systemd scripting. It used to be built in to start.sh on the Raspberry Pi's and a bunch of other machines but we've stripped that out of most builds. That requires a bit of work of course but it wouldn't be hard to deploy.

btw there is also ansible/playbooks/jenkins/worker/upgrade-jar.yml that you could try using. I haven't used it myself but it's worth playing with cause it could be run across most of our infra.

@refack
Copy link
Contributor Author

refack commented Apr 11, 2018

https://ci.nodejs.org/computer/test-joyent-smartos16-x64-2/ fixed by restarting slave.jar (smartos incantation is svcadm restart jenkins). Before doing that I checked https://ci.nodejs.org/computer/test-joyent-smartos16-x64-2/systemInfo and it still showed Unix slave, version 2.67 🤷‍♂️

Another assumption I had as to the cause of the failures was related to the owner of slave.jar, if it should be root.root or iojs.iojs. For now on test-joyent-smartos16-x64-2 I didn't chown it, so it's still is owned by root. I want to see if it makes any difference.

@joaocgreis
Copy link
Member

Oh and re updating slave.jar, I'd be happy to see that done as part of the init/upstart/systemd scripting.

+1, this has been part of the Windows script for a few years now and works great. The only drawback is that this is not straightforward for ci-release because it is locked, but this shouldn't stop us for test ci.

@gdams
Copy link
Member

gdams commented May 7, 2018

@joaocgreis I would reccomend using https://adoptopenjdk.net/ java binaries if we plan to upgrade all of our machines. There is a nice API (https://api.adoptopenjdk.net/README) detailing how it can be used. I'd be happy to work through the playbooks and switch out the java sections to use this if everyone is happy with that?

@joaocgreis
Copy link
Member

@gdams we started using Oracle Java at some point because it seemed to have better performance than the Open JDK that was installed in the machines. This was noticeable in the Jenkins server that is frequently under heavy load, and in the Raspberry Pis. However, this was only one of the things we did at the time and I'm not completely sure it was the cause of the improvement. If you feel sure about Open JDK performance, I wouldn't object to try it again (provided @rvagg is ok with that as well).

To be clear, when I mentioned updating slave.jar above, I did not mean updating Java, only the jar file that we run in the workers.

@sxa
Copy link
Member

sxa commented May 16, 2018

I wouldn't expect it to have different performance characteristics since it's fundamentally the same code. If there are scenarios in which the performance isn't the same, that would be useful for adoptopenjdk to be aware of, so I would be in favour of giving it another shot.

@keithc-ca
Copy link

The java code is mostly the same, but there are differences in the VM performance. Check out [1] for some more information about OpenJDK with OpenJ9, including some performance advantages that come with the OpenJ9 VM.

[1] https://www.eclipse.org/openj9/oj9_resources.html

@gdams
Copy link
Member

gdams commented May 16, 2018

Yes thanks @keithc-ca! It's worth pointing out that you can also fetch OpenJ9 binaries from AdoptOpenJDK! https://adoptopenjdk.net/releases.html?variant=openjdk8-openj9

@sxa
Copy link
Member

sxa commented May 16, 2018

Yes openjdk+openj9 will have different performance characterstics as @keith-ca says but the openjdk+hotspot builds from adoptopenjdk should be pretty much the same as oracle's current ones

@rvagg
Copy link
Member

rvagg commented May 17, 2018

I have no objections to switching to openjdk, I don't know if it buys us anything here but being able to get on to Java 9 might be helpful I suppose?

@BridgeAR
Copy link
Member

What's the status here? Should this stay open or is this resolved?

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants