Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed compat to enable tests on CI #60

Merged
merged 3 commits into from
Nov 10, 2023
Merged

Fixed compat to enable tests on CI #60

merged 3 commits into from
Nov 10, 2023

Conversation

stemann
Copy link
Collaborator

@stemann stemann commented Aug 3, 2023

  • Using CUDA_Runtime_jll to have CUDA 10.2 in LD_LIBRARY_PATH
  • Fixed compat for NNlib - to avoid breaking change in NNlib v0.7.25.
  • Limited compat to enable tests on Julia 1.6. Dropped testing on Julia > 1.6 as this would require a more recent Flux.jl and CUDA.jl.
  • "Adjusted" numerical accuracy of NNlib tests (see below)
  • Re-organized tests

@DhairyaLGandhi
Copy link
Member

the ci seems like it couldn't find the head of this PR, could we push into the branch and try to force it with a valid commit.

@stemann
Copy link
Collaborator Author

stemann commented Aug 3, 2023

Right - strange...

Where to query/report Buildkite CI issues? On Slack in #ci-failures ?

@stemann
Copy link
Collaborator Author

stemann commented Aug 9, 2023

@ToucheSir I got a little persistent with getting the tests running (I had a false memory of there being more tests)...

Anyway, the tests are being run now (with CUDA 10.2) - they are running using Flux v0.12 on Julia 1.9 (and using Flux v0.11 on Julia 1.6). I have almost zero experience with Flux: Are there any obvious quick fixes to just get the tests passing?

I changed ResNet() to ResNet(18). Now the tests are stalling at top = tresnet(tip) with tresnet being an Int64 in https://github.com/FluxML/Torch.jl/blob/master/test/runtests.jl#L22:

@testset "Flux" begin
  resnet = ResNet(18)
  tresnet = Flux.fmap(Torch.to_tensor, resnet.layers)

  ip = rand(Float32, 224, 224, 3, 1) # An RGB Image
  tip = tensor(ip, dev = 0) # 0 => GPU:0 in Torch

  top = tresnet(tip)
  op = resnet.layers(ip)

  gs = gradient(() -> sum(tresnet(tip)), Flux.params(tresnet))
  @test top isa Tensor
  @test size(top) == size(op)
  @test gs isa Flux.Zygote.Grads
end

@stemann stemann changed the title BuildKite: Set CUDA version to 10.2 Buildkite: Set CUDA version to 10.2 Aug 9, 2023
@ToucheSir
Copy link
Member

I'm not sure, but I suspect the very outdated compat for NNlib in Project.toml is holding Metalhead back and giving an older version which doesn't behave as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand Preferences.jl correctly, we should not have this checked in. If we need to set the preference at a package level, it should be set in Project.toml.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@stemann stemann mentioned this pull request Oct 22, 2023
3 tasks
@stemann
Copy link
Collaborator Author

stemann commented Oct 24, 2023

@ToucheSir I got a little persistent with getting the tests running (I had a false memory of there being more tests)...

Anyway, the tests are being run now (with CUDA 10.2) - they are running using Flux v0.12 on Julia 1.9 (and using Flux v0.11 on Julia 1.6). I have almost zero experience with Flux: Are there any obvious quick fixes to just get the tests passing?

Err... that thing was easily fixed ... eventually - by just not doing it completely wrong... (using ResNet(18) instead of ResNet18() ... 🤦 )

With the current code, it now stumbles on a more challenging error in NNLib.conv (in src/nnlib.jl#L9) - i.e. with:

@testset "Flux" begin
  resnet = ResNet18()
  tresnet = Flux.fmap(Torch.to_tensor, resnet.layers)

  ip = rand(Float32, 224, 224, 3, 1) # An RGB Image
  tip = tensor(ip, dev = 0) # 0 => GPU:0 in Torch

  top = tresnet(tip)
  op = resnet.layers(ip)

  gs = gradient(() -> sum(tresnet(tip)), Flux.params(tresnet))
  @test top isa Tensor
  @test size(top) == size(op)
  @test gs isa Flux.Zygote.Grads
end

the call top = tresnet(tip) fails while executing:

function NNlib.conv(x::Tensor{xT, N}, w::Tensor, b::Tensor{T},
                    cdims::DenseConvDims{M,K,C_in,C_out,S,P,D,F}) where {T,N,xT,M,K,C_in,C_out,S,P,D,F}
  op = conv2d(x, w, b, stride = collect(S), padding = [P[1];P[3]], dilation = collect(D))
  op
end

on a bounds error for P[3]:

ERROR: BoundsError: attempt to access Tuple{Int64, Int64} at index [3]
Stacktrace:
  [1] getindex(t::Tuple, i::Int64)
    @ Base ./tuple.jl:29
  [2] macro expansion
    @ ./show.jl:1128 [inlined]
  [3] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, b::Tensor{Float32, 1}, cdims::DenseConvDims{2, (7, 7), 3, 64, 1, (2, 2), (3, 3, 3, 3), (1, 1), false})
    @ Torch ~/jsa/Torch.jl/src/nnlib.jl:9
  [4] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, cdims::DenseConvDims{2, (7, 7), 3, 64, 1, (2, 2), (3, 3, 3, 3), (1, 1), false})
    @ Torch ~/jsa/Torch.jl/src/nnlib.jl:15
  [5] (::Conv{2, 2, typeof(identity), Tensor{Float32, 4}, Tensor{Float32, 1}})(x::Tensor{Float32, 4})
    @ Flux ~/.julia/packages/Flux/BPPNj/src/layers/conv.jl:166

where:

typeof(x), size(x) # = (Tensor{Float32, 4}, (224, 224, 3, 1))
typeof(w), size(w) # = (Tensor{Float32, 4}, (7, 7, 3, 64))
typeof(b), size(b) # = (Tensor{Float32, 1}, (64,))

b # = Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
cdims # = DenseConvDims: (224, 224, 3) * (7, 7) -> (112, 112, 64), stride: (2, 2), pad: (3, 3, 3, 3), dil: (1, 1), flip: false, groups: 1
S # = 1
P # = (2, 2)
D # = (3, 3, 3, 3)

@ToucheSir Any suggestion for what is failing here?

@stemann
Copy link
Collaborator Author

stemann commented Oct 24, 2023

Just changing P[3] to P[2] just results in an error in Torch.conv2d (in src/ops.jl#L93) - in reverse(stride) where typeof(stride), size(stride), stride = (Array{Int64, 0}, (), fill(1)) which is hard to reverse...

@stemann
Copy link
Collaborator Author

stemann commented Oct 24, 2023

@DhairyaLGandhi Can you remember why Base.getindex for Torch.Tensor was left in the non-functional state in src/tensor.jl#L84-L88 ?

function Base.getindex(t::Tensor{T,N}, I::Vararg{Int,N}) where {T,N}
  # @show reverse!(collect(I)) .- 1, size(t)
  # at_double_value_at_indexes(t.ptr, reverse!(collect(I)) .- 1, N)
  zero(T)
end

It is needed for display of Tensor's ... which would be neat for debugging... :-)

@DhairyaLGandhi
Copy link
Member

It was because the indexing had a bug that I was trying to figure out. And collecting returned correct results still if we needed to represent the tensor in a Julia-"native" datatype.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Oct 24, 2023

We don't need to fall back to NNlib's conv, the one in Torch may have to be updated to the latest API in NNlib if it doesn't get hit

@stemann
Copy link
Collaborator Author

stemann commented Oct 24, 2023

We don't need to fall back to NNlib's conv, the one in Torch may have to be updated to the latest API in NNlib if it doesn't get hit

It is hitting the Torch conv, but likely with the wrong input.

@stemann
Copy link
Collaborator Author

stemann commented Oct 25, 2023

Any idea why CI is still waiting for an agent?

Edit, Oct. 25: Cf. build 68 and build 69

Edit, Oct. 31: Build 68 and build 69 are still pending...

@stemann
Copy link
Collaborator Author

stemann commented Oct 25, 2023

It seems like rolling back to Julia v1.6-compatible versions of Flux and Metalhead can avoid the NNLib.conv(::Torch.Tensor, ::Torch.Tensor; ...)-issue. I will look into setting up a test env. (using Julia 1.1 test-specific dependency management)...

Edit/err: Fixed by limiting compatibility for NNlib.

@DhairyaLGandhi
Copy link
Member

We can probably drop support for intermediate versions of Julia, and low bound it to 1.6

@stemann
Copy link
Collaborator Author

stemann commented Nov 1, 2023

We can probably drop support for intermediate versions of Julia, and low bound it to 1.6

Yes - I was just trying out if the old version would still pass on Julia v1.5 - as v1.6 is still causing trouble wrt. the Flux integration.

@stemann
Copy link
Collaborator Author

stemann commented Nov 2, 2023

Alright: The older Julia 1.5 Manifest was using NNLib 0.7.10, and indeed if limiting NNlib to <= 0.7.24, the tests on Julia 1.6 gets as far as the tests on Julia 1.5 currently gets: Success! ✅

Edit: Fixed by running on pre-Ampere / CUDNN 7 compatible GPU:
Failing deeper into the NNlib.conv(::Tensor, ::Tensor; ...)-call - with a CUDNN error reported from Torch - might be due to the CUDNN version - or running on CUDA 10.2 (instead of 10.1...):

  Got exception outside of a @test

  "cuDNN error: CUDNN_STATUS_EXECUTION_FAILED (cudnn_convolution_add_bias_ at ../../aten/src/ATen/native/cudnn/Conv.cpp:812)

...



Stacktrace:
--
  | [1] macro expansion
  | @ /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/src/error.jl:16 [inlined]
  | [2] atg_conv2d(arg1::Base.RefValue{Ptr{Nothing}}, input::Ptr{Nothing}, weight::Ptr{Nothing}, bias::Ptr{Nothing}, stride_data::Vector{Int64}, stride_len::Int64, padding_data::Vector{Int64}, padding_len::Int64, dilation_data::Vector{Int64}, dilation_len::Int64, groups::Int64)
  | @ Torch /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/src/wrap/libdoeye_caml_generated.jl:904
  | [3] conv2d(input::Tensor{Float32, 4}, filter::Tensor{Float32, 4}, bias::Tensor{Float32, 1}; stride::Vector{Int64}, padding::Vector{Int64}, dilation::Vector{Int64}, groups::Int64)
  | @ Torch /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/src/ops.jl:92
  | [4] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, b::Tensor{Float32, 1}, cdims::DenseConvDims{2, (7, 7), 3, 64, (2, 2), (3, 3, 3, 3), (1, 1), false})
  | @ Torch /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/src/nnlib.jl:9
  | [5] conv(x::Tensor{Float32, 4}, w::Tensor{Float32, 4}, cdims::DenseConvDims{2, (7, 7), 3, 64, (2, 2), (3, 3, 3, 3), (1, 1), false})
  | @ Torch /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/src/nnlib.jl:15
  | [6] (::Conv{2, 2, typeof(identity), Tensor{Float32, 4}, Tensor{Float32, 1}})(x::Tensor{Float32, 4})
  | @ Flux ~/.cache/julia-buildkite-plugin/depots/6e7ea706-f768-4492-9c6f-30c3c87ddb4d/packages/Flux/goUGu/src/layers/conv.jl:147
  | [7] applychain(fs::Tuple{Conv{2, 2, typeof(identity), Tensor{Float32, 4}, Tensor{Float32, 1}}, MaxPool{2, 2}, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, MeanPool{2, 4}, Metalhead.var"#103#104", Dense{typeof(identity), Tensor{Float32, 2}, Tensor{Float32, 1}}, typeof(softmax)}, x::Tensor{Float32, 4})
  | @ Flux ~/.cache/julia-buildkite-plugin/depots/6e7ea706-f768-4492-9c6f-30c3c87ddb4d/packages/Flux/goUGu/src/layers/basic.jl:36
  | [8] (::Chain{Tuple{Conv{2, 2, typeof(identity), Tensor{Float32, 4}, Tensor{Float32, 1}}, MaxPool{2, 2}, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, Metalhead.ResidualBlock, MeanPool{2, 4}, Metalhead.var"#103#104", Dense{typeof(identity), Tensor{Float32, 2}, Tensor{Float32, 1}}, typeof(softmax)}})(x::Tensor{Float32, 4})
  | @ Flux ~/.cache/julia-buildkite-plugin/depots/6e7ea706-f768-4492-9c6f-30c3c87ddb4d/packages/Flux/goUGu/src/layers/basic.jl:38
  | [9] macro expansion
  | @ /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/test/runtests.jl:22 [inlined]
  | [10] macro expansion
  | @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
  | [11] top-level scope
  | @ /var/lib/buildkite-agent/builds/gpuci-14/julialang/torch-dot-jl/test/runtests.jl:16
```</strike>

@@ -58,7 +58,7 @@ end
test_output = NNlib.conv(x, w, cdims)

test_output = Array(test_output)
@test maximum(abs.(test_output - expected_output)) < 10 * eps(Float32)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjosv79, @DhairyaLGandhi: Did you ever experience problems with the numerical accuracy of these tests? (in relation to #38)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems they were never included in runtests.jl, so I suggest to leave them either skipped/marked broken or with the current relaxed constraint on their numerical accuracy.

@stemann stemann changed the title Buildkite: Set CUDA version to 10.2 Fixed compat to enable tests on CI Nov 3, 2023
@stemann stemann marked this pull request as ready for review November 3, 2023 03:26
@stemann stemann marked this pull request as draft November 3, 2023 03:48
Notably,
* Limited NNlib compat to <= 0.7.24: DenseConvDims was changed (breaking) in v0.7.25 (in FluxML/NNlib.jl@5ffabbc).

Also:
* Limited test-compat for Flux to v0.11.
* Limited test-compat for Zygote to v0.5.
* Removed Manifest.toml.
* Buildkite: Updated cuda definition.
* Buildkite: Set cap to sm_75 to limit to pre-Ampere GPUs (compatible with Torch_jll v1.4 CUDNN 7).
* Buildkite: Dropped testing on Julia > v1.6. Julia v1.7+ needs newer version of Flux.jl (> v0.11) to support a newer version of CUDA.jl (> v2).
On CI,
* max abs difference was up to 6.1035156f-5.
* max abs difference for L61 as high as 0.017906189f0

Also:
* Included test_nnlib.jl in runtests.jl.
@stemann stemann marked this pull request as ready for review November 3, 2023 04:08
@stemann
Copy link
Collaborator Author

stemann commented Nov 8, 2023

LGTM :-)

Copy link
Member

@ToucheSir ToucheSir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind reminding me what the order of PRs and dependencies going forwards is?

@ToucheSir ToucheSir merged commit a09c4f6 into FluxML:master Nov 10, 2023
@stemann
Copy link
Collaborator Author

stemann commented Nov 10, 2023

I would suggest the following:

  1. Updated Julia wrapper generator #59 to have that part cleaned-up
  2. Update the C wrapper, Updated C wrapper wrt. Torch v1.10 #61
  3. Re-do the Torch_jll in Yggdrasil - it is missing CUDA platform augmentation, I believe.
  4. Build the C wrapper (Updated C wrapper wrt. Torch v1.10 #61) in Yggdrasil, e.g. as TorchCAPI_jll - having Torch_jll as a dependency
  5. Update Torch.jl to use the updated wrapper

@stemann stemann deleted the feature/buildkite_cuda_10.2 branch November 10, 2023 09:48
@stemann stemann mentioned this pull request Nov 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants