Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Issue with TD3 for multi-dimensional action spaces #624

Closed
tyleringebrand opened this issue Apr 25, 2022 · 8 comments
Closed

Bug: Issue with TD3 for multi-dimensional action spaces #624

tyleringebrand opened this issue Apr 25, 2022 · 8 comments

Comments

@tyleringebrand
Copy link

I recently tried to use TD3 for a custom MDP I wrote, and run into an error when I tried to make it work for MDPs with more than 1 action dimension. I am able to reproduce the same bug with this code:

using ReinforcementLearning
using StableRNGs
using Flux
using Flux.Losses
using IntervalSets

function RL.Experiment(
    ::Val{:JuliaRL},
    ::Val{:TD3},
    ::Val{:Pendulum},
    ::Nothing;
    seed = 123,
)
    rng = StableRNG(seed)
    inner_env = PendulumEnv(T = Float32, rng = rng)
    A = action_space(inner_env)
    low = A.left
    high = A.right
    ns = length(state(inner_env))

    env = ActionTransformedEnv(
        inner_env;
        action_mapping = x -> low + (x + 1) * 0.5 * (high - low),
    )
    init = glorot_uniform(rng)

    create_actor() = Chain(
        Dense(ns, 30, relu; init = init),
        Dense(30, 30, relu; init = init),
        Dense(30, 1, tanh; init = init),
    ) |> gpu

    create_critic_model() = Chain(
        Dense(ns + 1, 30, relu; init = init),
        Dense(30, 30, relu; init = init),
        Dense(30, 1; init = init),
    ) |> gpu

    create_critic() = TD3Critic(create_critic_model(), create_critic_model())

    agent = Agent(
        policy = TD3Policy(
            behavior_actor = NeuralNetworkApproximator(
                model = create_actor(),
                optimizer = ADAM(),
            ),
            behavior_critic = NeuralNetworkApproximator(
                model = create_critic(),
                optimizer = ADAM(),
            ),
            target_actor = NeuralNetworkApproximator(
                model = create_actor(),
                optimizer = ADAM(),
            ),
            target_critic = NeuralNetworkApproximator(
                model = create_critic(),
                optimizer = ADAM(),
            ),
            γ = 0.99f0,
            ρ = 0.99f0,
            batch_size = 64,
            start_steps = 1000,
            start_policy = RandomPolicy(-1.0..1.0; rng = rng),
            update_after = 1000,
            update_freq = 1,
            policy_freq = 2,
            target_act_limit = 1.0,
            target_act_noise = 0.1,
            act_limit = 1.0,
            act_noise = 0.1,
            rng = rng,
        ),
        trajectory = CircularArraySARTTrajectory(
            capacity = 10_000,
            state = Vector{Float32} => (ns,),
            # action = Float32 => (), # Original line of code here. This assumes only 1 action dimension. What if we have multiple?
            action = Vector{Float32} => (1), # my change, makes it into a vector of float32s instead of a single value. 
                                                                # This mirrors the vector of floats for the state space. 
        ),
    )

    stop_condition = StopAfterStep(10_000, is_show_progress=!haskey(ENV, "CI"))
    hook = TotalRewardPerEpisode()
    Experiment(agent, env, stop_condition, hook, "# Play Pendulum with TD3")
end
using Plots
ex = E`JuliaRL_TD3_Pendulum`
run(ex)
plot(ex.hook.rewards)

Note this code is taken directly from https://juliareinforcementlearning.org/docs/experiments/experiments/Policy%20Gradient/JuliaRL_TD3_Pendulum/#JuliaRL\\_TD3\\_Pendulum
except for one change in trajectory object. See the comments in the code.

The error I get is:
adError: DimensionMismatch("mismatch in dimension 2 (expected 64 got 1)") which is about 40 layers deep in the stack trace, which occurs during Zygote auto differentiation:

Stacktrace:
  [1] _cs
    @ ./abstractarray.jl:1626 [inlined]
  [2] _cshp
    @ ./abstractarray.jl:1616 [inlined]
  [3] _cshp
    @ ./abstractarray.jl:1623 [inlined]
  [4] _cat_size_shape
    @ ./abstractarray.jl:1602 [inlined]
  [5] cat_size_shape(dims::Tuple{Bool}, X::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, tail::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
    @ Base ./abstractarray.jl:1600
  [6] _cat_t(::Val{1}, ::Type{Float32}, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::Vararg{Any, N} where N)
    @ Base ./abstractarray.jl:1646
  [7] cat_t(::Type{Float32}, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::Vararg{Any, N} where N; dims::Val{1})
    @ Base ./abstractarray.jl:1643
  [8] _cat
    @ ./abstractarray.jl:1782 [inlined]
  [9] #cat#129
    @ ./abstractarray.jl:1781 [inlined]
 [10] vcat
    @ ./abstractarray.jl:1787 [inlined]
 [11] rrule
    @ ~/.julia/packages/ChainRules/3yDBX/src/rulesets/Base/array.jl:283 [inlined]
 [12] rrule
    @ ~/.julia/packages/ChainRulesCore/RbX5a/src/rules.jl:134 [inlined]
 [13] chain_rrule
    @ ~/.julia/packages/Zygote/H6vD3/src/compiler/chainrules.jl:216 [inlined]
 [14] macro expansion
    @ ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0 [inlined]
 [15] _pullback
    @ ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:9 [inlined]
 [16] _pullback
    @ ~/.julia/packages/ReinforcementLearningZoo/mCTvc/src/algorithms/policy_gradient/td3.jl:8 [inlined]
 [17] _pullback(::Zygote.Context, ::TD3Critic, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [18] _apply
    @ ./boot.jl:804 [inlined]
 [19] adjoint
    @ ~/.julia/packages/Zygote/H6vD3/src/lib/lib.jl:200 [inlined]
 [20] _pullback
    @ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:65 [inlined]
 [21] _pullback
    @ ~/.julia/packages/ReinforcementLearningCore/s9XPF/src/policies/q_based_policies/learners/approximators/neural_network_approximator.jl:27 [inlined]
 [22] _pullback(::Zygote.Context, ::ReinforcementLearningCore.var"##_#115", ::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ::NeuralNetworkApproximator{TD3Critic, ADAM}, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [23] _apply(::Function, ::Vararg{Any, N} where N)
    @ Core ./boot.jl:804
 [24] adjoint
    @ ~/.julia/packages/Zygote/H6vD3/src/lib/lib.jl:200 [inlined]
 [25] _pullback
    @ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:65 [inlined]
 [26] _pullback
    @ ~/.julia/packages/ReinforcementLearningCore/s9XPF/src/policies/q_based_policies/learners/approximators/neural_network_approximator.jl:27 [inlined]
 [27] _pullback(::Zygote.Context, ::NeuralNetworkApproximator{TD3Critic, ADAM}, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CUDA.CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [28] _pullback
    @ ~/.julia/packages/ReinforcementLearningZoo/mCTvc/src/algorithms/policy_gradient/td3.jl:167 [inlined]
 [29] _pullback(::Zygote.Context, ::ReinforcementLearningZoo.var"#222#227"{TD3Policy{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, RandomPolicy{ClosedInterval{Float64}, StableRNGs.LehmerRNG}, StableRNGs.LehmerRNG}, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, NeuralNetworkApproximator{TD3Critic, ADAM}, CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}})
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface2.jl:0
 [30] pullback(f::Function, ps::Zygote.Params)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:352
 [31] gradient(f::Function, args::Zygote.Params)
    @ Zygote ~/.julia/packages/Zygote/H6vD3/src/compiler/interface.jl:75
 [32] update!(p::TD3Policy{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, RandomPolicy{ClosedInterval{Float64}, StableRNGs.LehmerRNG}, StableRNGs.LehmerRNG}, batch::NamedTuple{(:state, :action, :reward, :terminal, :next_state), Tuple{Matrix{Float32}, Matrix{Float32}, Vector{Float32}, Vector{Bool}, Matrix{Float32}}})
    @ ReinforcementLearningZoo ~/.julia/packages/ReinforcementLearningZoo/mCTvc/src/algorithms/policy_gradient/td3.jl:166
 [33] update!(p::TD3Policy{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, RandomPolicy{ClosedInterval{Float64}, StableRNGs.LehmerRNG}, StableRNGs.LehmerRNG}, traj::CircularArraySARTTrajectory{NamedTuple{(:state, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float32, 2, Matrix{Float32}}, CircularArrayBuffers.CircularArrayBuffer{Float32, 2, Matrix{Float32}}, CircularArrayBuffers.CircularVectorBuffer{Float32, Vector{Float32}}, CircularArrayBuffers.CircularVectorBuffer{Bool, Vector{Bool}}}}}, #unused#::ActionTransformedEnv{typeof(identity), var"#2#3"{Float64, Float64}, PendulumEnv{ClosedInterval{Float64}, Float32, StableRNGs.LehmerRNG}}, #unused#::PreActStage)
    @ ReinforcementLearningZoo ~/.julia/packages/ReinforcementLearningZoo/mCTvc/src/algorithms/policy_gradient/td3.jl:140
 [34] (::Agent{TD3Policy{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, RandomPolicy{ClosedInterval{Float64}, StableRNGs.LehmerRNG}, StableRNGs.LehmerRNG}, CircularArraySARTTrajectory{NamedTuple{(:state, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float32, 2, Matrix{Float32}}, CircularArrayBuffers.CircularArrayBuffer{Float32, 2, Matrix{Float32}}, CircularArrayBuffers.CircularVectorBuffer{Float32, Vector{Float32}}, CircularArrayBuffers.CircularVectorBuffer{Bool, Vector{Bool}}}}}})(stage::PreActStage, env::ActionTransformedEnv{typeof(identity), var"#2#3"{Float64, Float64}, PendulumEnv{ClosedInterval{Float64}, Float32, StableRNGs.LehmerRNG}}, action::Float64)
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/s9XPF/src/policies/agents/agent.jl:78
 [35] _run(policy::Agent{TD3Policy{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, RandomPolicy{ClosedInterval{Float64}, StableRNGs.LehmerRNG}, StableRNGs.LehmerRNG}, CircularArraySARTTrajectory{NamedTuple{(:state, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float32, 2, Matrix{Float32}}, CircularArrayBuffers.CircularArrayBuffer{Float32, 2, Matrix{Float32}}, CircularArrayBuffers.CircularVectorBuffer{Float32, Vector{Float32}}, CircularArrayBuffers.CircularVectorBuffer{Bool, Vector{Bool}}}}}}, env::ActionTransformedEnv{typeof(identity), var"#2#3"{Float64, Float64}, PendulumEnv{ClosedInterval{Float64}, Float32, StableRNGs.LehmerRNG}}, stop_condition::StopAfterStep{ProgressMeter.Progress}, hook::TotalRewardPerEpisode)
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/s9XPF/src/core/run.jl:29
 [36] run(policy::Agent{TD3Policy{NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, NeuralNetworkApproximator{Chain{Tuple{Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(tanh), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ADAM}, NeuralNetworkApproximator{TD3Critic, ADAM}, RandomPolicy{ClosedInterval{Float64}, StableRNGs.LehmerRNG}, StableRNGs.LehmerRNG}, CircularArraySARTTrajectory{NamedTuple{(:state, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float32, 2, Matrix{Float32}}, CircularArrayBuffers.CircularArrayBuffer{Float32, 2, Matrix{Float32}}, CircularArrayBuffers.CircularVectorBuffer{Float32, Vector{Float32}}, CircularArrayBuffers.CircularVectorBuffer{Bool, Vector{Bool}}}}}}, env::ActionTransformedEnv{typeof(identity), var"#2#3"{Float64, Float64}, PendulumEnv{ClosedInterval{Float64}, Float32, StableRNGs.LehmerRNG}}, stop_condition::StopAfterStep{ProgressMeter.Progress}, hook::TotalRewardPerEpisode)
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/s9XPF/src/core/run.jl:10
 [37] run(x::Experiment; describe::Bool)
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/s9XPF/src/core/experiment.jl:56
 [38] run(x::Experiment)
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/s9XPF/src/core/experiment.jl:55
 [39] top-level scope
    @ ~/DeepCorrectionCAS/MinErrorReproducibility.jl:86
 [40] include(fname::String)
    @ Base.MainInclude ./client.jl:444
 [41] top-level scope
    @ REPL[1]:1
in expression starting at /home/tyleri/DeepCorrectionCAS/MinErrorReproducibility.jl:86

My package versions look like:

  [7d9f7c33] Accessors v0.1.9
  [c7e460c6] ArgParse v1.1.4
  [fbb218c0] BSON v0.3.5
  [336ed68f] CSV v0.10.4
  [9de3a189] CircularArrayBuffers v0.1.10
  [d842c3ba] CommonRLInterface v0.3.1
  [a93c6f00] DataFrames v1.3.2
  [b4f34e82] Distances v0.10.7
  [31c24e10] Distributions v0.25.53
  [da5c29d0] EllipsisNotation v1.5.0
  [587475ba] Flux v0.12.10
  [86223c79] Graphs v1.6.0
  [bb4c363b] GridInterpolations v1.1.2
  [7073ff75] IJulia v1.23.3
  [c601a237] Interact v0.10.4
  [8197267c] IntervalSets v0.5.4
  [4138dd39] JLD v0.13.1
  [033835bb] JLD2 v0.4.22
  [872c559c] NNlib v0.8.4
  [3b7a836e] PGFPlots v3.4.2
  [f3bd98c0] POMDPLinter v0.1.1
  [08074719] POMDPModelTools v0.3.12
  [182e52fb] POMDPPolicies v0.4.2
  [a93abf59] POMDPs v0.9.4
  [d96e819e] Parameters v0.12.3
  [91a5bcdd] Plots v1.27.5
  [92933f4c] ProgressMeter v1.7.2
  [438e738f] PyCall v1.93.1
  [d330b81b] PyPlot v2.10.0
  [3cdcf5f2] RecipesBase v1.2.1
  [158674fc] ReinforcementLearning v0.10.0
  [e575027e] ReinforcementLearningBase v0.9.7
  [d607f57d] ReinforcementLearningZoo v0.5.10
  [860ef19b] StableRNGs v1.0.0
  [2913bbd2] StatsBase v0.33.16
  [bd369af6] Tables v1.7.0
  [899adc3e] TensorBoardLogger v0.1.19
  [b8865327] UnicodePlots v2.10.3
  [0f1e0344] WebIO v0.8.17
  [e88e6eb3] Zygote v0.6.37
  [8bb1440f] DelimitedFiles

I am using julia 1.6.2.

Any help on how to make TD3 work for multidimensional action spaces? Is this a bug or user error?

@tyleringebrand tyleringebrand changed the title Issue with TD3 for multi-dimensional action spaces Bug: Issue with TD3 for multi-dimensional action spaces Apr 25, 2022
@findmyway
Copy link
Member

        action = Vector{Float32} => (1), # my change, makes it into a vector of float32s instead of a single value. 

I vaguely remember it should be (na,) here. But I haven't tried your example here.

@tyleringebrand
Copy link
Author

        action = Vector{Float32} => (1), # my change, makes it into a vector of float32s instead of a single value. 

I vaguely remember it should be (na,) here. But I haven't tried your example here.

na in this case is an Integer which is 1. So, that gets the same result. I will try the inner comma though.

@tyleringebrand
Copy link
Author

        action = Vector{Float32} => (1), # my change, makes it into a vector of float32s instead of a single value. 

I vaguely remember it should be (na,) here. But I haven't tried your example here.

na in this case is an Integer which is 1. So, that gets the same result. I will try the inner comma though.

No luck with the inner comma. I did confirm ns is a Int64, so I assume na would/should be as well. So 1 should work just the same as assigning 1 to na and then using na.

@findmyway
Copy link
Member

Thanks for the feedback, I'll take a look into it later tonight.

@findmyway
Copy link
Member

Hi @tyleringebrand ,

I think the bug comes from the following line:

I should fix it soon. Thanks again for reporting it.

findmyway added a commit to findmyway/ReinforcementLearning.jl that referenced this issue May 1, 2022
@findmyway
Copy link
Member

@all-contributors please add @tyleringebrand for bug

@allcontributors
Copy link
Contributor

@findmyway

I've put up a pull request to add @tyleringebrand! 🎉

@tyleringebrand
Copy link
Author

Thanks @findmyway! I tested it on my custom MDP and everything works as expected. For anyone who runs into this before the patch is released (I assume version v0.11), you can get the bug fix using "add ReinforcementLearningZoo#findmyway-patch-7" in the package manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants