Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis: Rewrite config, build and test also on Alpine Linux (musl libc) #1255

Merged
merged 4 commits into from
Aug 2, 2017
Merged

Conversation

jirutka
Copy link
Contributor

@jirutka jirutka commented Jul 28, 2017

  1. Refactores .travis.yml to use Travis Build Stages. This allows to define jobs with very different configuration in more clean way.
  2. Replaces sudo apt-get with APT addon. Since sudo is not needed anymore, it runs on Travis containers-based infrastructure that is much faster than their VMs infrastructure (used when sudo is needed).
  3. Adds jobs that build and test OpenBLAS on Alpine Linux v3.6 (inside chroot), i.e. verify that it works with musl libc. I hope that this will help to avoid errors like Building v0.2.20 fails on non-glibc systems #1252 in future.

You’ve been still running on Ubuntu Presty builders, but new default is Trusty (read more). Thus I‘ve explicitly set dist: presty to let it stay on Presty, to not change build environment by this commit.

Alpine jobs needs sudo (for chroot), so they run on VMs infrastructure. That’s why they are much slower than other jobs.

@jirutka
Copy link
Contributor Author

jirutka commented Jul 28, 2017

Travis killed one job due to too long output, because gcc generates tons of warnings on your code… (EDIT: I was wrong, it actually segfaulted)

@brada4
Copy link
Contributor

brada4 commented Jul 28, 2017

How much warnings are too much? Will halving their volume help?

@martin-frbg
Copy link
Collaborator

Jobs being killed because of overly long output has not been a problem with the existing travis setup as far as I know - is this some consequence of your build environment (different compiler version etc) or did you use additional CFLAGS ? BTW the oversight from #1252 would easily have been caught with a judiciuosly placed #undef __GLIBC_PREREQ, in my opinion the main problem was that a misunderstanding between xianyi and me led to it being merged right before the release when I wanted to have it on hold until later.

@jirutka
Copy link
Contributor Author

jirutka commented Jul 28, 2017

How much warnings are too much? Will halving their volume help?

Eh, any warnings? gcc doesn’t print them just for fun… But about Travis, it has limit 4 MiB for output on stdout+stderr per job.

The right solution is to fix code, possible workaround is to turn off some of these warnings.

Jobs being killed because of overly long output has not been a problem with the existing travis setup as far as I know - is this some consequence of your build environment (different compiler version etc) or did you use additional CFLAGS ?

When you look at some of your past jobs, for example this one, you’re already on the edge of the limit:

This log is too long to be displayed. Please reduce the verbosity of your build or download the raw log.

Preparing Alpine environment adds some more text into the log and it just flipped over the limit (EDIT: no, it segfaulted) (on Alpine job with USE_OPENMP=1). You will run into this issue sooner or later even with your existing jobs.

BTW the oversight from #1252 would easily have been caught with a judiciuosly placed #undef __GLIBC_PREREQ…

Of course, as all possible errors… It happened and got into the release without notice because you don’t regularly test against musl libc. That’s why I opened this PR to prevent such mistakes in future.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 28, 2017

From a quick look at the failed job, a big number of them appear to be the recently introduced gcc indentation warnings, complaints about potentially uninitialized variables that would need to be addressed in netlib lapack and a bunch of harmless pointer type issues that I believe may even be platform-specifc. Suggest to turn off at least the indentation warnings for now.

@7heo
Copy link

7heo commented Jul 28, 2017

"[...] harmless pointer [...]" — @martin-frbg

Best joke ever. 🤣 Thanks for making my Friday mate.

@martin-frbg
Copy link
Collaborator

@7heo care to prove otherwise ? These superficially mismatching casts have been there forever but there are real problems elsewhere in the code that need fixing.

@7heo
Copy link

7heo commented Jul 28, 2017

@martin-frbg memory management is tricky and shouldn't be dealth with lightly. the least you can do is cast your stuff explicitely so it reads what it actually does.

@jirutka
Copy link
Contributor Author

jirutka commented Jul 28, 2017

I was wrong, the failing job (musl, USE_OPENMP=1) hasn’t been killed for too long output, it segfaulted in tests.*

https://travis-ci.org/xianyi/OpenBLAS/jobs/258570695#L6743:

...
 cblas_ctrmm  PASSED THE TESTS OF ERROR-EXITS
 cblas_ctrmm  PASSED THE COLUMN-MAJOR COMPUTATIONAL TESTS (  2592 CALLS)

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f53526ef3ef in ???
#1  0x7f53526ee9e5 in ???
#2  0x7f5352a00f85 in ???
Segmentation fault (core dumped)
make[1]: *** [Makefile:79: all3] Error 139
make: *** [Makefile:119: tests] Error 2

* Travis marks failed job with a cross when some command in script exited with non-zero status or with exclamation mark when some error occurred. Since the log output exceeded size for being displayed on web page (“This log is too long to be displayed.”), I just quickly assumed that it exceeded 4MiB limit.

@jirutka
Copy link
Contributor Author

jirutka commented Jul 28, 2017

We set USE_OPENMP=0 in Alpine package due to gcc bug 60035. I don’t know if it’s still relevant though. I’m gonna mark this particular job as allowed to fail, I’m not really keen to investigate what exactly is wrong.

@7heo
Copy link

7heo commented Jul 28, 2017

I'm gonna quote Natanael Copa's FOSDEM presentation (first slide):

The good things with musl

[...]

  • musl tries to be (annoyingly) correct
  • when patch is needed, code gets better

[...]

@jirutka
Copy link
Contributor Author

jirutka commented Jul 28, 2017

Your existing job with USE_OPENMP=1 sometimes fails even on Ubuntu Precise, but with different error, see https://travis-ci.org/xianyi/OpenBLAS/jobs/258662730. It may be related to the change from VM to container environment (read #1255 (comment)); you can switch it back to VM (i.e. exactly the same environment as you currently have) with sudo: true on global or job level. However, it IMHO just confirms that there’s surely something wrong in the code…

I’ve also added job with NO_AFFINITY=1 which reveals bug #1252 and confirms that it’s not fixed in develop branch (#1249).

The job building/testing OpenBLAS with USE_OPENMP=1 in Alpine environment (i.e. musl libc) is marked as “allowed to fail”, i.e. it’s failure does not affect result of whole build (you can see it here).

I consider this PR finished and ready to merge. CI failure on this PR is not a problem of this PR, but existing problems.

@martin-frbg
Copy link
Collaborator

ctrmm issue may be the same issue as #1089 (worked around in #1254), at least that test triggers the same debug printf (in gbmv_thread.c) in my working copy. Will retest when I am more awake.

@brada4
Copy link
Contributor

brada4 commented Jul 28, 2017

@jirutka - your finger-pointing does not help. It is your code and you are a developer by modifying source files.
trmm test happens to be last in row, corruption can happen anywhere before. blank backtrace is of no value, even if coming from semantically picky library

@jirutka
Copy link
Contributor Author

jirutka commented Jul 29, 2017

@brada4 What finger-pointing? What are you talking about? I'm not OpenBLAS developer, I've only contributed this PR, in a hope to prevent future breakages of compatibility with non-glibc systems.

@brada4
Copy link
Contributor

brada4 commented Jul 29, 2017

musl does not support complex numbers. Plese be so kind and generate backtrace with function labels or work around musl shortage yourself.

@martin-frbg
Copy link
Collaborator

@brada4 by all means let's look into that, but as far as I know OpenBLAS will work around lack of support for complex numbers by creating structs etc (which is where I believe the pointer warnings are from). However the ctrmm issue appears to be real and not limited to musl - will be interesting to learn if #1254 fixes it.

@brada4
Copy link
Contributor

brada4 commented Jul 29, 2017

Slightly worse - http://wiki.musl-libc.org/wiki/Mathematical_Library#Complex - probably needs some ifdef wrapper to force struct complex.

@xianyi
Copy link
Collaborator

xianyi commented Aug 1, 2017

Hi all,

Because I access travis-ci very slow, I cannot see the full log in travis-ci. I see some error case in travis-ci. Is it expected?

@martin-frbg
Copy link
Collaborator

One is "ld" getting killed with an actual SIGKILL - this is probably what jirutka attributed to travis aborting when the log gets too big (tons of warnings about implicit float/double conversions in LAPACK etc). The next is "my" cgroups patch breakage (musl does not define __GLIBC_PREREQ) so should hopefully go away with #1257. Guess I should take the courage to merge that now, and re-basing this PR afterwards would probably be sufficient to trigger a re-run of the travis checks. (With the xBMV overrun fix already merged, there should be no test failures remaining)

@martin-frbg
Copy link
Collaborator

Hmm. Still getting segmentation faults in the run you had marked as "allowed to fail" despite #1254, wonder if it would be possible to generate a backtrace for that.
All activity on the gcc/libgomp bug ticket you mentioned seems to have ceased in late 2014 - I wonder if
the pitfalls of the combination of fork and OMP in a gcc-built program should be documented in the wiki somewhere ?

@martin-frbg
Copy link
Collaborator

@jirutka re-enable the "allow_failures" if you like - I'm not sure if the commit would end up getting credited to me if I was the last developer to meddle with it. It seems I botched at least parts of the xbmv fix anyway, introducing accuracy errors.Perhaps I need to stop coding for a while until things settle down for me. At least the GLIBC_PREREQ problem seems to be gone now.

@jirutka
Copy link
Contributor Author

jirutka commented Aug 1, 2017

Because I access travis-ci very slow, I cannot see the full log in travis-ci.

Travis is very slow in rendering very long build log. However, you can view it also in raw (as plain text), that’s fast. Just click on “Raw log” above the rendered log (on job page).

If you can’t get there due to slow JS, you can also access it directly. On build page, look at link of particular job, e.g. https://travis-ci.org/xianyi/OpenBLAS/jobs/259694421. Then download https://api.travis-ci.org/jobs/<ID>/log.txt?deansi=true with curl or web browser, e.g. https://api.travis-ci.org/jobs/259694421/log.txt?deansi=true.

jirutka added 4 commits August 1, 2017 21:52
Using APT addon has nice side-effect - you don't need sudo anymore, so
it can run on Travis containers-based infrastructure that is much faster
than their VMs infrastructure (used when sudo is needed).

You've been still running on Ubuntu Presty builders, but new default is
Trusty. Thus I've explicitly set `dist: presty` to let it stay on
Presty, to not change build environment by this commit.
Alpine jobs needs sudo (for chroot), so they run on VMs infrastructure.
That's why they are much slower than other jobs.
@jirutka
Copy link
Contributor Author

jirutka commented Aug 1, 2017

wonder if it would be possible to generate a backtrace for that.

I’ve prepared you a VM with Alpine and prepared setup for building OpenBLAS, but GitHub doesn’t return any SSH keys for your account https://github.com/martin-frbg.keys. I don’t understand why, this should work for all accounts. o.O

…re-enable the "allow_failures" if you like…

I’ve removed your commit disabling allow_failures and rebased this PR against the current develop branch. All jobs except LINUX64_MUSL USE_OPENMP=1 (the one “allowed to fail”) now pass!

BTW It didn’t make sense to disable it, the point of allow_failures is that Travis runs the listed jobs, correctly report the job status, just doesn’t mark whole build as failed when some of allow_failures job fail.

@martin-frbg
Copy link
Collaborator

  1. probably because I do not have a local repository linked to my github account, strange as it may sound I still upload changed files in the browser or use the primitive editor on github for changes.
  2. thought as much but just for triggering a rerun of travis this change seemed as good as any (and whitespace changes or deleting and restoring a character did not register for good reason)
    (3 seems my panic revert of 1254 was overkill, as it looks now it is a single mistake in gbmv_thread.c
    that caused the sudden "loss of precision" errors. My plan for now is to merge your PR tomorrow (if nobody complains), then push the fixed 1254PR and see if it makes the "allowed failure" case pass.

@martin-frbg martin-frbg merged commit 53aee36 into OpenMathLib:develop Aug 2, 2017
@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 2, 2017

Key added now, thanks for setting up the VM.
The "allowed failure" in cblas_ctrmm appears to be unrelated to the gbmv errors, helgrind shows lots of thread races apparently originating from missing locks in blas_server_omp. Unfortunately valgrind stumbles over the inline assembly when building for HASWELL, a NEHALEM build raises no issues (on a
glibc system) that would explain an eventual segfault. Edit: helgrind is known to be confused by libgomp, so that one seems to be a red herring as well.

@jirutka
Copy link
Contributor Author

jirutka commented Aug 6, 2017

Hm, I can’t reproduce the error in LINUX64_MUSL USE_OPENMP=1 ctest on the VM or my building machine. It seems that it’s somehow related to Travis environment, maybe kernel version or hardware?

@jirutka jirutka deleted the travis branch August 6, 2017 09:07
@brada4
Copy link
Contributor

brada4 commented Aug 6, 2017

Was it chroot inside Ubuntu LTS or ???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants