Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault bisected to fix for #11715 #11945

Closed
mweastwood opened this issue Jun 30, 2015 · 18 comments
Closed

Segfault bisected to fix for #11715 #11945

mweastwood opened this issue Jun 30, 2015 · 18 comments

Comments

@mweastwood
Copy link
Contributor

As of 63e5735, running the test script at HEALPix.jl (currently unregistered) generates a segfault 100% of the time on my machine (but not on Travis, apparently).

signal (11): Segmentation fault
__pool_alloc at /scr2/mweastwood/julia/src/gc.c:1073
jl_box_int64 at /scr2/mweastwood/julia/src/alloc.c:754
jl_ptr_to_array at /scr2/mweastwood/julia/src/array.c:268
readhealpix at /home/mweastwood/.julia/HEALPix/src/map.jl:91
jl_apply_generic at /scr2/mweastwood/julia/src/gf.c:1650
anonymous at ./no file:14
jl_apply at /scr2/mweastwood/julia/src/julia.h:1299
jl_toplevel_eval_flex at /scr2/mweastwood/julia/src/toplevel.c:569
jl_load at /scr2/mweastwood/julia/src/toplevel.c:616
include at ./boot.jl:254
jl_apply_generic at /scr2/mweastwood/julia/src/gf.c:1652
include_from_node1 at ./loading.jl:133
jl_apply_generic at /scr2/mweastwood/julia/src/gf.c:1650
process_options at ./client.jl:304
_start at ./client.jl:404
unknown function (ip: 1867504297)
jl_apply_generic at /scr2/mweastwood/julia/src/gf.c:1652
unknown function (ip: 4201385)
unknown function (ip: 4200287)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 4200365)
Segmentation fault (core dumped)

A reduced test case is eluding me right now because removing seemingly unrelated pieces of code makes the segfault go away. Apparently one condition for the segfault to appear is that there needs to be enough cruft in the surrounding code. The offending line of code appears to be a call to pointer_to_array here.

Hopefully this is enough information, but let me know if I can do anything else.

$ julia -e "versioninfo()"
Julia Version 0.4.0-dev+5385
Commit 63e5735* (2015-06-15 19:25 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT NO_AFFINITY NEHALEM)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.5.0
@kmsquire
Copy link
Member

Hi Michael, thanks for the report. It might be the case that someone steps in to help, but it definitely would be good if you can come up with and post a reduced test case.

In particular, segfaults when interfacing with external libraries can be hard to debug (I'm working through one right now in VideoIO.jl). I've sometimes found that, while a particular commit of Julia might trigger a segfault in my own code, it is often that the bug was mine and there all along, and was only uncovered by a change in Julia. (Not saying that's the case here, but it's worth considering.)

Since you bisected, I'm assuming this happens on later versions than you have listed above (such as a commit from the last few days)?

@mweastwood
Copy link
Contributor Author

That's correct, this has only shown up recently. Earlier versions don't produce the segfault, later versions do.

I'll give the reduced test case another shot tomorrow (I promise I tried really hard to get one already!).

I'll also take another look at my own code.

@yuyichao
Copy link
Contributor

@mweastwood I couldn't get the build script to build all the libraries. (I get an error that /home/yuyichao/.julia/v0.4/HEALPix/src/../deps/downloads/Healpix_3.20/lib/libchealpix.so does not exist and it really does not exist) Any idea how to solve this ?

@yuyichao
Copy link
Contributor

@carnaval

Any idea what this GC verifier error message suggests?

JL_GC_ALLOC_PRINT=0:100 JL_GC_ALLOC_POOL=917800 JL_GC_ALLOC_OTHER=0 JULIA_LOAD_PATH=${PWD}/.. ~/projects/julia/gc-debug/julia -f test/runtests.jl  
GC error (probable corruption) :
TypeName(name=Type, module=SimpleVector, names=SimpleVector, primary=Any, cache=Int32, linearcache=Bool, uid=140728650090000)

@yuyichao
Copy link
Contributor

OK somehow the type is overwritten by a cfitsio function

Old value = (void *) 0x7ffdf132c2b3
New value = (void *) 0x7ffdf132c2b1
restore () at gc.c:2126
2126            for(int i = 0; i < bits_save[b].len; i++) {
(gdb) 
Continuing.
Hardware watchpoint 1: *(void**)0x7ffdf150c0e8

Old value = (void *) 0x7ffdf132c2b1
New value = (void *) 0x7ffdf132c200
0x00007ffdda13141b in ffc2s () from /usr/lib/libcfitsio.so.2
(gdb) 

@yuyichao
Copy link
Contributor

Find the issue.

The backtrace of the corruption is attached.

From the source code, it seems that the coordsys and ordering to read_healpix_map are two output parameters that output strings but you are passing in two single byte buffers (Ref) so the library write pass valid regions and corrupts julia's memory.

I've never used any of the libraries so I don't know how you should change it.

Close since it is not a julia issue.

Hardware watchpoint 2: *(void**)0x7ffdf150c0e8

Old value = (void *) 0x7ffdf132c2b1
New value = (void *) 0x7ffdf132c200
ffc2s (instr=instr@entry=0x7fffffffc650 "'C       '", 
    outstr=outstr@entry=0x7ffdf150c0e0 "C       ", 
    status=status@entry=0x7fffffffc71c) at fitscore.c:9198
9198        if (ii == len)
(gdb) bt
#0  ffc2s (instr=instr@entry=0x7fffffffc650 "'C       '", 
    outstr=outstr@entry=0x7ffdf150c0e0 "C       ", 
    status=status@entry=0x7fffffffc71c) at fitscore.c:9198
#1  0x00007ffdda1664c0 in ffgkys (fptr=<optimized out>, keyname=<optimized out>, 
    value=0x7ffdf150c0e0 "C       ", comm=<optimized out>, status=0x7fffffffc71c)
    at getkey.c:793
#2  0x00007ffdda166d53 in ffgky (fptr=<optimized out>, 
    datatype=datatype@entry=16, keyname=keyname@entry=0x7ffdda4b63f0 "COORDSYS", 
    value=value@entry=0x7ffdf150c0e0, comm=comm@entry=0x0, 
    status=status@entry=0x7fffffffc71c) at getkey.c:292
#3  0x00007ffdda4b5dbb in read_healpix_map (
    infile=0x7ffdf356b460 "/tmp/juliaBeMm1k.fits", nside=<optimized out>, 
    coordsys=0x7ffdf150c0e0 "C       ", 
    ordering=0x7ffdf150c100 "P\v\357\367\377\177") at chealpix.c:985
#4  0x00007ffff7e4d10e in julia_readhealpix_21571 (filename=<optimized out>)
    at /home/yuyichao/projects/mirrors/HEALPix.jl/src/map.jl:88
#5  0x00007ffff6aef895 in jl_apply (nargs=1, args=0x7fffffffc8b8, 
    f=<optimized out>) at julia.h:1260
#6  jl_apply_generic (F=0x7ffdf34f1c90, args=0x7fffffffc8b8, 
    nargs=<optimized out>) at gf.c:1656
#7  0x00007ffff6b41053 in jl_apply (nargs=1, args=0x7fffffffc8b8, 
    f=0x7ffdf34f1c90) at julia.h:1260
#8  do_call (f=f@entry=0x7ffdf34f1c90, args=args@entry=0x7ffdf3516938, 
    nargs=nargs@entry=1, eval0=eval0@entry=0x0, 
    locals=locals@entry=0x7fffffffd0e0, nl=nl@entry=0, ngensym=1)
    at interpreter.c:65
#9  0x00007ffff6b403e9 in eval (e=0x7ffdf34aadd0, locals=0x7fffffffd0e0, 
    nl=nl@entry=0, ngensym=ngensym@entry=1) at interpreter.c:212
#10 0x00007ffff6b4014c in eval (e=e@entry=0x7ffdf34aadb0, 
    locals=locals@entry=0x7fffffffd0e0, nl=nl@entry=0, ngensym=ngensym@entry=1)
    at interpreter.c:218
#11 0x00007ffff6b41671 in eval_body (stmts=stmts@entry=0x7ffdf354ee90, 
    locals=locals@entry=0x7fffffffd0e0, nl=nl@entry=0, ngensym=ngensym@entry=1, 
    toplevel=1, start=0) at interpreter.c:592
#12 0x00007ffff6b41a8d in jl_toplevel_eval_body (stmts=0x7ffdf354ee90)
    at interpreter.c:525
#13 0x00007ffff6b53b57 in jl_toplevel_eval_flex (e=<optimized out>, 
    fast=fast@entry=1) at toplevel.c:511
#14 0x00007ffff6b542ec in jl_toplevel_eval_flex (fast=1, e=<optimized out>)
    at toplevel.c:563
#15 jl_parse_eval_all (
    fname=fname@entry=0x7ffdf3371b40 "/home/yuyichao/projects/mirrors/HEALPix.jl/test/runtests.jl", len=<optimized out>) at toplevel.c:567
#16 0x00007ffff6b544d8 in jl_load (
    fname=0x7ffdf3371b40 "/home/yuyichao/projects/mirrors/HEALPix.jl/test/runtests.jl") at toplevel.c:607

@carnaval
Copy link
Contributor

@yuyichao just for future reference, as you probably guessed by now, this error message is when someone trashed a tag. Often happens in pools with OOB stores, we may even want to enable this check in release I'm not sure.

@yuyichao
Copy link
Contributor

@carnaval Yeah I knew that the error message means the tag is corrupted. I just want to see if you can tell anything from the value of the corrupted tag but it's apparently unrelated.

@yuyichao
Copy link
Contributor

unrelated -> unrelated to julia

@carnaval
Copy link
Contributor

The weird thing that this tag was another valid julia object often happens when someone only touches the LSB of an existing thing. Unfortunately it's the only "downside" of having an almost too easy C FFI. We could have a debug mode which allocates every object with 8 or 16 bytes of canary space at the end that we would fill with randomness before every ccall and check that the ccall was honest afterward.

@yuyichao
Copy link
Contributor

The weird thing that this tag was another valid julia object often happens when someone only touches the LSB of an existing thing.

That's exactly what happend. See the bit pattern changes above.

We could have a debug mode which allocates every object with 8 or 16 bytes of canary space at the end that we would fill with randomness before every ccall and check that the ccall was honest afterward.

Sound's good.

@ScottPJones
Copy link
Contributor

👍 for @carnaval's idea of the debug mode

@mweastwood
Copy link
Contributor Author

@yuyichao That should be built by this script which is run from here. Did the build script run without errors?

I'm going to try to replace the external dependency with some dummy functions to see if I can still generate the segfault.

@mweastwood
Copy link
Contributor Author

Oops, didn't hit refresh! Apologies for the noise and thanks for all your efforts!

@yuyichao
Copy link
Contributor

@mweastwood And just FYI, the build error is because that stupid healpix build system doesn't want to link to a shared cfitsio library. I edit the configure script to make it work....

@mweastwood
Copy link
Contributor Author

@yuyichao The healpix build system is indeed pretty awful. I'm sure you already noticed I needed to patch it in two places just to get things working on Travis..

mweastwood added a commit to mweastwood/LibHealpix.jl that referenced this issue Jun 30, 2015
@tkelman
Copy link
Contributor

tkelman commented Jul 1, 2015

@carnaval sounds a little like a "ccall sanitizer"

@yuyichao
Copy link
Contributor

@carnaval I've thought about the ccall sanitizer a little bit and somehow the implementation that requires the least (i.e. no) support from julia side (when writing unsafe_convert or pointer) requires the runtime to tell if a pointer is pointing to within a julia object.

The issue I see is that the conversion from julia object to c pointer is done in julia code (except Any?) and it feels hard to figure out which object (or memory) it belongs to. Even if we special case a few common types (Array, Ref and the special & operator), it will still miss a lot of cases (as an example, this won't work for passing the pointer to ccall by manually calling pointer on a array).

Another way to do this is to just to let codegen throw all pointer arguments to the GC and let the GC figure out if this is a gc managed memory and sanitize it when necessary. This also has the additional advantage that it can be used to see if the object that is holding the memory is properly rooted.

I remember you said that some code for the conservative scan is already there (to figure out whether a pointer is pointing inside a pool allocated GC object). Is the code in the master or is it on a branch?

I might open a separate issue for this later...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants