Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

cascading failure causing segfault in GtkReactive tests? #562

Closed
vtjnash opened this issue Feb 12, 2021 · 36 comments
Closed

cascading failure causing segfault in GtkReactive tests? #562

vtjnash opened this issue Feb 12, 2021 · 36 comments

Comments

@vtjnash
Copy link
Contributor

vtjnash commented Feb 12, 2021

https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_date/2021-02/11/GtkReactive.1.7.0-DEV-50742c7e42.log

(julia:170): GLib-GObject-CRITICAL **: 06:21:42.833: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:170): GLib-GObject-CRITICAL **: 06:21:42.843: g_object_setv: assertion 'G_IS_OBJECT (object)' failed

(julia:170): GLib-GObject-CRITICAL **: 06:21:42.872: g_closure_new_object: assertion 'G_IS_OBJECT (object)' failed

signal (11): Segmentation fault
in expression starting at /home/pkgeval/.julia/packages/GtkReactive/gQSXq/test/runtests.jl:22
unsafe_store! at ./pointer.jl:118 [inlined]
_signal_connect at /home/pkgeval/.julia/packages/Gtk/TgEIW/src/GLib/signals.jl:40
signal_connect at /home/pkgeval/.julia/packages/Gtk/TgEIW/src/GLib/signals.jl:27 [inlined]
signal_connect at /home/pkgeval/.julia/packages/Gtk/TgEIW/src/GLib/signals.jl:27
@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

Is this a) reproducible and b) also happens in 1.6? I also have seen segfaults on 1.6-beta on an internal software of mine. A minimal reproducible example would be good to track this down.

@giordano
Copy link
Contributor

Those are the tests run by PkgEval. Note that tests for nightly have been been failing forever since we switched to Github Actions: https://github.com/JuliaGraphics/Gtk.jl/actions?query=workflow%3ACI

@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

Hm, yes. But nightly != 1.6 right?

@giordano
Copy link
Contributor

Yes, but I was referring to point 1, tests have been consistently failing for months. Maybe looking at them might help. Julia v1.6 isn't tested at the moment in CI (but you just need to add one line to the matrix)

@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

Hm, "Error: Could not find a Julia version that matches 1.6". Do you know what to enter there? 1.6-?

@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

tests passing locally on Julia 1.6rc1

@giordano
Copy link
Contributor

At the moment it's ^1.6.0-0, it'll be 1.6.0, 1.6, or 1 when it'll be released

@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

radio: Error During Test at /home/runner/work/Gtk.jl/Gtk.jl/test/gui.jl:340
170
  Test threw exception
171
  Expression: get_gtk_property(get_gtk_property(r, :active), :label, AbstractString) == choices[2]
172
  no active elements in GtkRadioGroup

seems unrelated to the segfault though.

@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

ok, its the same test failure as on nightly.

@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

I do not get why this is failing on Julia 1.6 (CI) but not not on 1.3 (CI) and also not locally on 1.6.

@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

Locally I get:

julia> get_gtk_property(get_gtk_property(r,:active),:label,AbstractString)
"choice two"

@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

@giordano: Can you somehow identify if different Gtk binaries are used for different Julia versions?

@giordano
Copy link
Contributor

also not locally on 1.6.

That's the fun part of running CI, isn't it? 😃

Note that you can SSH into the github actions runners for debugging with the step

    - name: Setup tmate session
      uses: mxschmitt/action-tmate@v3
      with:
        limit-access-to-actor: true

right before the run tests step (you can remove the last two lines if you don't want to use your ssh key to login). See https://github.com/marketplace/actions/debugging-with-tmate

@timholy
Copy link
Member

timholy commented Feb 13, 2021

TIL, thanks @giordano!

@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

Ok, I have some findings and can even reproduce the issue locally. Here is my test program:

# file is called gui2.jl here locally
using Test, Gtk
using Gtk.ShortNames, Gtk.GConstants, Gtk.Graphics

numFailures = 0
for l=1:1000
  choices = ["choice one", "choice two", "choice three", RadioButton("choice four"), Label("choice five")]
  r = RadioButtonGroup(choices,2)
  active = [get_gtk_property(b,:active,Bool) for b in r]
  if sum(active) != 1
    @show active
    global numFailures += 1
  end
end

@show numFailures

So the issue is with the calls where the vector of active elements is obtained. Sometimes (not always) I get the following run:

julia> include("gui2.jl")

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.590: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.596: g_object_get_property: assertion 'G_IS_OBJECT (object)' failed

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.596: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.596: g_object_get_property: assertion 'G_IS_OBJECT (object)' failed

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.596: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:6981): GLib-GObject-CRITICAL **: 20:43:05.596: g_object_get_property: assertion 'G_IS_OBJECT (object)' failed
active = Bool[0, 0, 0, 0, 0]
numFailures = 1
1

So it seems that something went wrong Gtk internally and then simply false is returned even for the active element.

@tknopp
Copy link
Collaborator

tknopp commented Feb 13, 2021

I do not get this error with Julia 1.5.

@tknopp
Copy link
Collaborator

tknopp commented Feb 14, 2021

So these sort of spontaneous errors are not the one I am capable of tracking down since this certainly touches the core of the type conversion in done in the glib module. The interesting question is what has changed from Julia 1.5 to 1.6 that can lead to these non-deterministic sort of errors. Ping @vtjnash who is much better prepared to solve this and ping @KristofferC to make you aware of this regression.

@jonathanBieler
Copy link
Collaborator

I suspect it's an issue with custom widgets since RadioButtonGroup is one (see also #557). Maybe there's some issues here ?

function gobject_move_ref(new::GObject, old::GObject)

@timholy
Copy link
Member

timholy commented Feb 23, 2021

@jonathanBieler, the RadioButtonGroup code doesn't use gobject_move_ref:

GtkRadioButtonGroup(layout::GtkContainer) = new(layout)

If I change the rhs to gobject_move_ref(new(layout), layout), I get this:

(julia:85406): GLib-GObject-CRITICAL **: 07:54:00.060: g_object_set_qdata_full: assertion 'G_IS_OBJECT (object)' failed

(julia:85406): GLib-GObject-CRITICAL **: 07:54:00.060: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:85406): Gtk-CRITICAL **: 07:54:00.075: gtk_container_add: assertion 'GTK_IS_CONTAINER (container)' failed

(julia:85406): Gtk-CRITICAL **: 07:54:00.075: gtk_container_add: assertion 'GTK_IS_CONTAINER (container)' failed

(julia:85406): Gtk-CRITICAL **: 07:54:00.076: gtk_container_add: assertion 'GTK_IS_CONTAINER (container)' failed

(julia:85406): Gtk-CRITICAL **: 07:54:00.103: gtk_container_add: assertion 'GTK_IS_CONTAINER (container)' failed

(julia:85406): Gtk-CRITICAL **: 07:54:00.118: gtk_container_add: assertion 'GTK_IS_CONTAINER (container)' failed

all immediately upon calling RadioButtonGroup(choices,2).

EDIT: the assertion 'G_IS_OBJECT (object)' failed are triggered at

gc_ref(new)

and the assertion 'GTK_IS_CONTAINER (container)' failed by each push!:

push!(grp, e, i == active)

@timholy
Copy link
Member

timholy commented Feb 23, 2021

OK, this diff gets us closer to the design suggested in https://juliagraphics.github.io/Gtk.jl/latest/manual/customWidgets/:

diff --git a/src/buttons.jl b/src/buttons.jl
index 4574ee9..e19b533 100644
--- a/src/buttons.jl
+++ b/src/buttons.jl
@@ -59,9 +59,9 @@ mutable struct GtkRadioButtonGroup <: GtkContainer # NOT a native @gtktype
     # the behavior is specified as undefined if the first
     # element is moved to a new group
     # do not rely on the current behavior, since it may change
-    handle::GtkContainer
+    handle::Ptr{GObject}
     anchor::GtkRadioButton
-    GtkRadioButtonGroup(layout::GtkContainer) = new(layout)
+    GtkRadioButtonGroup(layout::GtkContainer) = gobject_move_ref(new(unsafe_convert(Ptr{GObject}, layout)), layout)
 end
 const GtkRadioButtonGroupLeaf = GtkRadioButtonGroup
 macro GtkRadioButtonGroup(args...)
@@ -88,7 +88,7 @@ function push!(grp::GtkRadioButtonGroup, e::GtkRadioButton)
     else
         grp.anchor = e
     end
-    push!(grp.handle, e)
+    container_add!(grp.handle, e)
     grp
 end
 function push!(grp::GtkRadioButtonGroup, label, active::Union{Bool, Nothing} = nothing)
@@ -100,7 +100,7 @@ function push!(grp::GtkRadioButtonGroup, label, active::Union{Bool, Nothing} = n
     if isa(active, Bool)
         gtk_toggle_button_set_active(e, active::Bool)
     end
-    push!(grp.handle, e)
+    container_add!(grp.handle, e)
     grp
 end
 function start_(grp::GtkRadioButtonGroup)
diff --git a/src/container.jl b/src/container.jl
index a07ee41..b6609ee 100644
--- a/src/container.jl
+++ b/src/container.jl
@@ -1,9 +1,11 @@
+container_add!(w, child)    = ccall((:gtk_container_add, libgtk),    Nothing, (Ptr{GObject}, Ptr{GObject},), w, child)
+container_remove!(w, child) = ccall((:gtk_container_remove, libgtk), Nothing, (Ptr{GObject}, Ptr{GObject},), w, child)
 function push!(w::GtkContainer, child)
-    ccall((:gtk_container_add, libgtk), Nothing, (Ptr{GObject}, Ptr{GObject},), w, child)
+    container_add!(w, child)
     w
 end
 function delete!(w::GtkContainer, child::GtkWidget)
-    ccall((:gtk_container_remove, libgtk), Nothing, (Ptr{GObject}, Ptr{GObject},), w, child)
+    container_remove!(w, child)
     w
 end
 function empty!(w::GtkContainer)

Sometimes it succeeds, but I still get intermittent failures like

(julia:92982): GLib-GObject-CRITICAL **: 08:57:46.379: g_object_ref_sink: assertion 'G_IS_OBJECT (object)' failed

(julia:92982): GLib-GObject-CRITICAL **: 08:57:46.384: g_object_get_property: assertion 'G_IS_OBJECT (object)' failed

that seem to be reduced by adding various sleeps. What's the right way to do this?

@vtjnash
Copy link
Contributor Author

vtjnash commented Feb 23, 2021

I think you might need a GC.@preserve w in places like

iterate(w::GtkRadioButtonGroup, s=start_(w)) = iterate(s, s)

@timholy
Copy link
Member

timholy commented Feb 23, 2021

Ho ho! I hadn't noticed that, but that's a good catch.

I added it everywhere start_ gets used in buttons.jl (there are other uses for other types but I didn't try it there). I still get intermittent errors like the above.

@timholy
Copy link
Member

timholy commented Feb 23, 2021

I think you're seriously on to something, though: if I wrap

Gtk.jl/test/gui.jl

Lines 314 to 342 in d743a8b

choices = ["choice one", "choice two", "choice three", RadioButton("choice four"), Label("choice five")]
w = Window("Radio")
f = Gtk.GtkBox(:v); push!(w,f)
r = Vector{RadioButton}(undef, 3)
r[1] = RadioButton(choices[1]); push!(f,r[1])
r[2] = RadioButton(r[1],choices[2]); push!(f,r[2])
r[3] = RadioButton(r[2],choices[3],active=true); push!(f,r[3])
@test [get_gtk_property(b,:active,Bool) for b in r] == [false, false, true]
set_gtk_property!(r[1],:active,true)
@test [get_gtk_property(b,:active,Bool) for b in r] == [true, false, false]
showall(w)
destroy(w)
r = RadioButtonGroup(choices,2)
@test length(r) == 5
@test sum([get_gtk_property(b,:active,Bool) for b in r]) == 1
itms = Vector{Any}(undef,length(r))
for (i,e) in enumerate(r)
itms[i] = try
get_gtk_property(e,:label,AbstractString)
catch
e[1]
end
end
@test setdiff(choices, itms) == [choices[4],]
@test setdiff(itms, choices) == ["choice four",]
@test get_gtk_property(get_gtk_property(r,:active),:label,AbstractString) == choices[2]
w = Window(r,"RadioGroup")|>showall
destroy(w)

in GC.enable(false); ... ; GC.enable(true) it seems to pass every time.

@timholy
Copy link
Member

timholy commented Feb 23, 2021

There really does seem to be something different between 1.5 and 1.6. If you put

GC.gc(true); GC.gc(true); GC.gc(true)

right after the r = RadioButtonGroup(choices,2), then on 1.6 I get failures on alternating runs. (weird) It never seems to fail on 1.5.

As disclosure, I've also commented out most steps that run prior to the "radio" tests, and I've also removed the @testset.

diff --git a/test/gui.jl b/test/gui.jl
index 024af93..e5a9b81 100755
--- a/test/gui.jl
+++ b/test/gui.jl
@@ -2,7 +2,7 @@
 
 using Gtk.ShortNames, Gtk.GConstants, Gtk.Graphics
 import Gtk.deleteat!, Gtk.libgtk_version, Gtk.GtkToolbarStyle, Gtk.GtkFileChooserAction, Gtk.GtkResponseType
-
+#=
 ## for FileFilter
 # This is just for testing, and be careful of garbage collection while using this
 struct GtkFileFilterInfo
@@ -309,8 +309,8 @@ set_gtk_property!(check,:label,"new label")
 showall(w)
 destroy(w)
 end
-
-@testset "radio" begin
+=#
+# @testset "radio" begin
 choices = ["choice one", "choice two", "choice three", RadioButton("choice four"), Label("choice five")]
 w = Window("Radio")
 f = Gtk.GtkBox(:v); push!(w,f)
@@ -325,7 +325,9 @@ showall(w)
 destroy(w)
 
 r = RadioButtonGroup(choices,2)
+GC.gc(true); GC.gc(true); GC.gc(true)
 @test length(r) == 5
+GC.enable(false)
 @test sum([get_gtk_property(b,:active,Bool) for b in r]) == 1
 itms = Vector{Any}(undef,length(r))
 for (i,e) in enumerate(r)
@@ -335,12 +337,17 @@ for (i,e) in enumerate(r)
             e[1]
         end
 end
+GC.enable(true)
 @test setdiff(choices, itms) == [choices[4],]
 @test setdiff(itms, choices) == ["choice four",]
+GC.enable(false)
 @test get_gtk_property(get_gtk_property(r,:active),:label,AbstractString) == choices[2]
+GC.enable(true)
 w = Window(r,"RadioGroup")|>showall
 destroy(w)
-end
+# end
+# GC.enable(true)
+error("stop")
 
 @testset "ToggleButton" begin
 tb = ToggleButton("Off")

@timholy
Copy link
Member

timholy commented Feb 23, 2021

Another observation: if I combine the above with disabling GC while the RadioButtonGroup constructor runs, then I get the assertion 'G_IS_OBJECT (object)' failed on every run. So it seems that GC needs to run during construction?

@timholy
Copy link
Member

timholy commented Feb 23, 2021

I am now onto other deadlines, but to leave this in a good state, here's a MWE:

using Revise   # curiously, this seems to be important to making this fully reproducible, though CI fails without
using Gtk, Gtk.ShortNames, Test

choices = ["choice one", "choice two", "choice three", RadioButton("choice four"), Label("choice five")]
r = RadioButtonGroup(choices,2)
GC.gc(true); GC.gc(true); GC.gc(true)
@test length(r) == 5
@test sum([get_gtk_property(b,:active,Bool) for b in r]) == 1

This reliably passes on 1.5 and reliably fails on 1.6. If you comment out the GC.gc(true); GC.gc(true); GC.gc(true) line then it seems to pass most or all the time on 1.6. The outcome is repeatable even in the same session, you don't need a restart.

Both are running Gtk 1.1.6.

@timholy
Copy link
Member

timholy commented Feb 24, 2021

I've opened an issue (see link above) that identifies JuliaLang/julia#38180 as the probable source of the change. What's really odd is that Gtk doesn't use WeakKeyDict (lots of WeakRefs though), but there were also changes to gc.c.

@vtjnash
Copy link
Contributor Author

vtjnash commented Feb 24, 2021

That changed WeakRef so they must be re-set by a finalizer if it wants to keep them active. E.g.

w = WeakRef(x)
finalizer(x) do x; keepalive(x) && w.value = x; end

@timholy
Copy link
Member

timholy commented Feb 24, 2021

Sorry, I don't understand. I don't find keepalive in Base, GC, Gtk, or Gtk.GLib. Are there any docs on this?

@vtjnash
Copy link
Contributor Author

vtjnash commented Feb 25, 2021

I suppose I should have just said the_rest_of_the_finalizer(x)

@timholy
Copy link
Member

timholy commented Feb 26, 2021

OK. It would be ideal if someone who understands memory management in Gtk (@jonathanBieler ?) fixes this here, and @vtjnash adds the requisite NEWS entry (JuliaLang/julia#39811 (comment)). If @jonathanBieler can't fix it I can tackle it, but I'd like to wait for @vtjnash's NEWS entry first since the enhanced clarity will save me time (which is in short supply).

@jonathanBieler
Copy link
Collaborator

@timholy Sadly I don't understand much about memory management in Gtk.jl or Julia for that matter.

@vtjnash
Copy link
Contributor Author

vtjnash commented Feb 26, 2021

I don't see an obvious reason it should behave different. Someone might want to run this under rr, so you watch where the g_object_unref calls happen and see where it should be fixed.

@timholy
Copy link
Member

timholy commented Feb 26, 2021

Sorry @jonathanBieler, I had you mixed up with with @bfredl who added much of the glib stuff but who hasn't been around a lot lately.

@timholy
Copy link
Member

timholy commented Feb 27, 2021

Unfortunately, every time I've tried to use rr in the last two years I get segfaults that don't happen without it:

$ rr record ./julia-debug --args /tmp/gtk.jl
rr: Saving execution to trace directory `/home/tim/.local/share/rr/julia-debug-0'.
Segmentation fault

@tknopp
Copy link
Collaborator

tknopp commented Feb 27, 2021

( I know this does not change anything, but a big "thank you" for investigating Tim )

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants