-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed compat to enable tests on CI #60
Conversation
stemann
commented
Aug 3, 2023
•
edited
Loading
edited
- Using CUDA_Runtime_jll to have CUDA 10.2 in LD_LIBRARY_PATH
- Fixed compat for NNlib - to avoid breaking change in NNlib v0.7.25.
- Limited compat to enable tests on Julia 1.6. Dropped testing on Julia > 1.6 as this would require a more recent Flux.jl and CUDA.jl.
- "Adjusted" numerical accuracy of NNlib tests (see below)
- Re-organized tests
the ci seems like it couldn't find the head of this PR, could we push into the branch and try to force it with a valid commit. |
Right - strange... Where to query/report Buildkite CI issues? On Slack in #ci-failures ? |
@ToucheSir I got a little persistent with getting the tests running (I had a false memory of there being more tests)... Anyway, the tests are being run now (with CUDA 10.2) - they are running using Flux v0.12 on Julia 1.9 (and using Flux v0.11 on Julia 1.6). I have almost zero experience with Flux: Are there any obvious quick fixes to just get the tests passing? I changed @testset "Flux" begin
resnet = ResNet(18)
tresnet = Flux.fmap(Torch.to_tensor, resnet.layers)
ip = rand(Float32, 224, 224, 3, 1) # An RGB Image
tip = tensor(ip, dev = 0) # 0 => GPU:0 in Torch
top = tresnet(tip)
op = resnet.layers(ip)
gs = gradient(() -> sum(tresnet(tip)), Flux.params(tresnet))
@test top isa Tensor
@test size(top) == size(op)
@test gs isa Flux.Zygote.Grads
end |
I'm not sure, but I suspect the very outdated compat for NNlib in Project.toml is holding Metalhead back and giving an older version which doesn't behave as expected. |
LocalPreferences.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand Preferences.jl correctly, we should not have this checked in. If we need to set the preference at a package level, it should be set in Project.toml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Err... that thing was easily fixed ... eventually - by just not doing it completely wrong... (using With the current code, it now stumbles on a more challenging error in @testset "Flux" begin
resnet = ResNet18()
tresnet = Flux.fmap(Torch.to_tensor, resnet.layers)
ip = rand(Float32, 224, 224, 3, 1) # An RGB Image
tip = tensor(ip, dev = 0) # 0 => GPU:0 in Torch
top = tresnet(tip)
op = resnet.layers(ip)
gs = gradient(() -> sum(tresnet(tip)), Flux.params(tresnet))
@test top isa Tensor
@test size(top) == size(op)
@test gs isa Flux.Zygote.Grads
end the call function NNlib.conv(x::Tensor{xT, N}, w::Tensor, b::Tensor{T},
cdims::DenseConvDims{M,K,C_in,C_out,S,P,D,F}) where {T,N,xT,M,K,C_in,C_out,S,P,D,F}
op = conv2d(x, w, b, stride = collect(S), padding = [P[1];P[3]], dilation = collect(D))
op
end on a bounds error for ERROR: BoundsError: attempt to access Tuple{Int64, Int64} at index [3]
Stacktrace:
[1] getindex(t::Tuple, i::Int64)
@ Base ./tuple.jl:29
[2] macro expansion
@ ./show.jl:1128 [inlined]
[3] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, b::Tensor{Float32, 1}, cdims::DenseConvDims{2, (7, 7), 3, 64, 1, (2, 2), (3, 3, 3, 3), (1, 1), false})
@ Torch ~/jsa/Torch.jl/src/nnlib.jl:9
[4] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, cdims::DenseConvDims{2, (7, 7), 3, 64, 1, (2, 2), (3, 3, 3, 3), (1, 1), false})
@ Torch ~/jsa/Torch.jl/src/nnlib.jl:15
[5] (::Conv{2, 2, typeof(identity), Tensor{Float32, 4}, Tensor{Float32, 1}})(x::Tensor{Float32, 4})
@ Flux ~/.julia/packages/Flux/BPPNj/src/layers/conv.jl:166 where: typeof(x), size(x) # = (Tensor{Float32, 4}, (224, 224, 3, 1))
typeof(w), size(w) # = (Tensor{Float32, 4}, (7, 7, 3, 64))
typeof(b), size(b) # = (Tensor{Float32, 1}, (64,))
b # = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
cdims # = DenseConvDims: (224, 224, 3) * (7, 7) -> (112, 112, 64), stride: (2, 2), pad: (3, 3, 3, 3), dil: (1, 1), flip: false, groups: 1
S # = 1
P # = (2, 2)
D # = (3, 3, 3, 3) @ToucheSir Any suggestion for what is failing here? |
Just changing |
@DhairyaLGandhi Can you remember why function Base.getindex(t::Tensor{T,N}, I::Vararg{Int,N}) where {T,N}
# @show reverse!(collect(I)) .- 1, size(t)
# at_double_value_at_indexes(t.ptr, reverse!(collect(I)) .- 1, N)
zero(T)
end It is needed for display of |
It was because the indexing had a bug that I was trying to figure out. And |
We don't need to fall back to NNlib's |
It is hitting the Torch conv, but likely with the wrong input. |
It seems like rolling back to Julia v1.6-compatible versions of Flux and Metalhead can avoid the Edit/err: Fixed by limiting compatibility for NNlib. |
We can probably drop support for intermediate versions of Julia, and low bound it to 1.6 |
Yes - I was just trying out if the old version would still pass on Julia v1.5 - as v1.6 is still causing trouble wrt. the Flux integration. |
Alright: The older Julia 1.5 Manifest was using NNLib 0.7.10, and indeed if limiting NNlib to <= 0.7.24, the tests on Julia 1.6 gets as far as the tests on Julia 1.5 currently gets: Success! ✅ Edit: Fixed by running on pre-Ampere / CUDNN 7 compatible GPU:
|
@@ -58,7 +58,7 @@ end | |||
test_output = NNlib.conv(x, w, cdims) | |||
|
|||
test_output = Array(test_output) | |||
@test maximum(abs.(test_output - expected_output)) < 10 * eps(Float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjosv79, @DhairyaLGandhi: Did you ever experience problems with the numerical accuracy of these tests? (in relation to #38)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems they were never included in runtests.jl, so I suggest to leave them either skipped/marked broken or with the current relaxed constraint on their numerical accuracy.
Notably, * Limited NNlib compat to <= 0.7.24: DenseConvDims was changed (breaking) in v0.7.25 (in FluxML/NNlib.jl@5ffabbc). Also: * Limited test-compat for Flux to v0.11. * Limited test-compat for Zygote to v0.5. * Removed Manifest.toml. * Buildkite: Updated cuda definition. * Buildkite: Set cap to sm_75 to limit to pre-Ampere GPUs (compatible with Torch_jll v1.4 CUDNN 7). * Buildkite: Dropped testing on Julia > v1.6. Julia v1.7+ needs newer version of Flux.jl (> v0.11) to support a newer version of CUDA.jl (> v2).
On CI, * max abs difference was up to 6.1035156f-5. * max abs difference for L61 as high as 0.017906189f0 Also: * Included test_nnlib.jl in runtests.jl.
LGTM :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind reminding me what the order of PRs and dependencies going forwards is?
I would suggest the following:
|