Using PPOPolicy with custom environment with action masking in ReinforcementLearning.jl

In order to use Action Masking with he PPOPolicy i think i have to use a MaskedPPOTrajectory. The legal_actions_mask gets updated in the update! function with the result of legal_action_space_mask(env), but afaiu this should always be an AbstractArray of type Bool with the same length as the action_space. Now i get the following error:

LoadError: MethodError: no method matching CircularArrayBuffers.CircularArrayBuffer(::Array{Bool, 97}, ::Int64, ::Int64, ::Bool)
Closest candidates are:
  CircularArrayBuffers.CircularArrayBuffer(::Array{T, N}, ::Int64, ::Int64, ::Int64) where {T, N} at /home/mfg/.julia/packages/CircularArrayBuffers/hxJJh/src/CircularArrayBuffers.jl:12
  CircularArrayBuffers.CircularArrayBuffer(::AbstractArray{T, N}) where {T, N} at /home/mfg/.julia/packages/CircularArrayBuffers/hxJJh/src/CircularArrayBuffers.jl:27
Stacktrace:
  [1] (CircularArrayBuffers.CircularArrayBuffer{Bool, N} where N)(::Bool, ::Vararg{Integer, N} where N)
    @ CircularArrayBuffers ~/.julia/packages/CircularArrayBuffers/hxJJh/src/CircularArrayBuffers.jl:24
  [2] (::ReinforcementLearningCore.var"#51#52"{Int64})(x::Pair{DataType, Vector{Bool}})
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/Gdaf7/src/policies/agents/trajectories/trajectory.jl:58
  [3] map(f::ReinforcementLearningCore.var"#51#52"{Int64}, t::Tuple{Pair{DataType, Tuple{Int64, Int64}}, Pair{DataType, Vector{Bool}}, Pair{DataType, Tuple{Int64}}})
    @ Base ./tuple.jl:215
  [4] map(::Function, ::NamedTuple{(:state, :legal_actions_mask, :action), Tuple{Pair{DataType, Tuple{Int64, Int64}}, Pair{DataType, Vector{Bool}}, Pair{DataType, Tuple{Int64}}}})
    @ Base ./namedtuple.jl:197
  [5] CircularArrayTrajectory(; capacity::Int64, kwargs::Base.Iterators.Pairs{Symbol, Pair{DataType, B} where B, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:state, :legal_actions_mask, :action), Tuple{Pair{DataType, Tuple{Int64, Int64}}, Pair{DataType, Vector{Bool}}, Pair{DataType, Tuple{Int64}}}}})
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/Gdaf7/src/policies/agents/trajectories/trajectory.jl:57
  [6] (CircularArraySLARTTrajectory{var"#s90"} where var"#s90"<:(NamedTuple{(:state, :legal_actions_mask, :action, :reward, :terminal), var"#s54"} where var"#s54"<:(Tuple{var"#s5", var"#s4", var"#s3", var"#s1", var"#s91"} where {var"#s5"<:CircularArrayBuffers.CircularArrayBuffer, var"#s4"<:CircularArrayBuffers.CircularArrayBuffer, var"#s3"<:CircularArrayBuffers.CircularArrayBuffer, var"#s1"<:CircularArrayBuffers.CircularArrayBuffer, var"#s91"<:CircularArrayBuffers.CircularArrayBuffer})))(; capacity::Int64, state::Pair{DataType, Tuple{Int64, Int64}}, legal_actions_mask::Pair{DataType, Vector{Bool}}, action::Pair{DataType, Tuple{Int64}}, reward::Pair{DataType, Tuple{Int64}}, terminal::Pair{DataType, Tuple{Int64}})
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/Gdaf7/src/policies/agents/trajectories/trajectory.jl:194
  [7] (Trajectory{var"#s269"} where var"#s269"<:(NamedTuple{(:action_log_prob, :state, :legal_actions_mask, :action, :reward, :terminal), var"#s268"} where var"#s268"<:(Tuple{var"#s267", var"#s266", var"#s262", var"#s261", var"#s260", var"#s200"} where {var"#s267"<:CircularArrayBuffers.CircularArrayBuffer, var"#s266"<:CircularArrayBuffers.CircularArrayBuffer, var"#s262"<:CircularArrayBuffers.CircularArrayBuffer, var"#s261"<:CircularArrayBuffers.CircularArrayBuffer, var"#s260"<:CircularArrayBuffers.CircularArrayBuffer, var"#s200"<:CircularArrayBuffers.CircularArrayBuffer})))(; capacity::Int64, action_log_prob::Pair{DataType, Tuple{Int64}}, kwargs::Base.Iterators.Pairs{Symbol, Pair{DataType, B} where B, NTuple{5, Symbol}, NamedTuple{(:state, :action, :reward, :terminal, :legal_actions_mask), Tuple{Pair{DataType, Tuple{Int64, Int64}}, Pair{DataType, Tuple{Int64}}, Pair{DataType, Tuple{Int64}}, Pair{DataType, Tuple{Int64}}, Pair{DataType, Vector{Bool}}}}})
    @ ReinforcementLearningZoo ~/.julia/packages/ReinforcementLearningZoo/3LTNB/src/algorithms/policy_gradient/ppo.jl:41

I guess i’m doing it wrong, but i couldn’t find any examples using MaskedPPOTrajectory.

I am using this with a custom environment and it’s possible to use this environment with the RandomPolicy and with QBasedPolicy.

I am using Julia 1.6.3.

Any help appreciated :slight_smile:

Hi @mfg ,

Could you share the code of initializing the Trajectory part?

It should be something like:

julia> trajectory = MaskedPPOTrajectory(;
                   capacity = UPDATE_FREQ,
                   state = Matrix{Float32} => (ns, N_ENV),
                   action = Vector{Int} => (N_ENV,),
                   legal_actions_mask = Vector{Bool} => (na, N_ENV),
                   action_log_prob = Vector{Float32} => (N_ENV,),
                   reward = Vector{Float32} => (N_ENV,),
                   terminal = Vector{Bool} => (N_ENV,),
               )

Thank you for the answer :slight_smile:

I thought i have to initialize the legal_actions_mask and had legal_actions_mask = Vector{Bool} => legal_action_space_mask(env) there :see_no_evil:

Now i’m back at the problem i had before (when using PPOTrajectory instead of MaskedPPOTrajectory): The policy selects an action which should have been masked. Is there anything special i have to do in my custom environment besides implementing legal_action_space and legal_action_space_mask?

I think only the following two extra methods need to be defined:

Okay, that’s what i have. I will look more into it and ask again if i have more questions :slight_smile:

Thanks for your help!

It seems as if the mask simply isn’t used. It also errors out on the first try, maybe i need some kind of initialization somewhere?

Because after the error occurs (i’m just including the source file at the REPL) a legal_action_space(env) and legal_action_space_mask(env) show the expected values. There has to be something special with regard to PPOPolicy that i don’t see.

Would be nice to see any example using PPOPolicy and an action mask :frowning:

This should be fixed in https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/pull/533 now. (Try to update the dependency first.)

I don’t have a workable environment of FULL_ACTION_SET at hand. But the following example should be enough for you to understand how to use it. (Especially pay attention to the lines with comment # !!!)

using ReinforcementLearning
using StableRNGs
using Flux
using Flux.Losses

save_dir = nothing
seed = 123

RLBase.ActionStyle(::CartPoleEnv) = FULL_ACTION_SET  # !!!
RLBase.legal_action_space_mask(::CartPoleEnv) = Bool[1, 1]  # !!!

rng = StableRNG(seed)
N_ENV = 8
UPDATE_FREQ = 32
env = MultiThreadEnv([
    CartPoleEnv(; T=Float32, rng=StableRNG(hash(seed + i))) for i in 1:N_ENV
])
ns, na = length(state(env[1])), length(action_space(env[1]))
RLBase.reset!(env; is_force=true)
agent = Agent(;
    policy=PPOPolicy(;
        approximator=cpu(
            ActorCritic(;
                actor=Chain(
                    Dense(ns, 256, relu; init=glorot_uniform(rng)),
                    Dense(256, na; init=glorot_uniform(rng)),
                ),
                critic=Chain(
                    Dense(ns, 256, relu; init=glorot_uniform(rng)),
                    Dense(256, 1; init=glorot_uniform(rng)),
                ),
                optimizer=ADAM(1e-3),
            ),
        ),
        γ=0.99f0,
        λ=0.95f0,
        clip_range=0.1f0,
        max_grad_norm=0.5f0,
        n_epochs=4,
        n_microbatches=4,
        actor_loss_weight=1.0f0,
        critic_loss_weight=0.5f0,
        entropy_loss_weight=0.001f0,
        update_freq=UPDATE_FREQ,
    ),
    trajectory=MaskedPPOTrajectory(;  # !!!
        capacity=UPDATE_FREQ,
        state=Matrix{Float32} => (ns, N_ENV),
        action=Vector{Int} => (N_ENV,),
        legal_actions_mask=Vector{Bool} => (na, N_ENV),  # !!!
        action_log_prob=Vector{Float32} => (N_ENV,),
        reward=Vector{Float32} => (N_ENV,),
        terminal=Vector{Bool} => (N_ENV,),
    ),
)

stop_condition = StopAfterStep(10_000; is_show_progress=!haskey(ENV, "CI"))
hook = TotalBatchRewardPerEpisode(N_ENV)
run(agent, env, stop_condition, hook)
Progress: 100%|████████████████████████████████████████████████████████████████████| Time: 0:00:17
             ⠀⠀⠀⠀⠀⠀⠀Avg total reward per episode⠀⠀⠀⠀⠀⠀⠀ 
             ┌────────────────────────────────────────┐ 
         200 │⠀⠀⠀⠀⠀⠀⢀⢀⡄⡀⠀⠀⡜⠀⣰⢷⡄⢇⠀⠀⡇⣀⠦⡠⢎⢢⢸⡀⣸⡷⡏⠉⠉⠉⠉⠁⠀⠀⠀⠀│ 
             │⠀⠀⠀⠀⠀⠀⡸⣾⠈⡇⠀⢠⠇⣦⢻⢸⢸⠘⢄⢸⢱⠁⡤⡀⡟⢜⣼⣷⢹⢳⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠀⠀⠀⠀⠀⡰⠁⠁⠀⢱⠀⢸⢰⠙⡎⠸⡼⡀⠈⠁⡎⣰⠁⠱⠇⢸⢻⡏⢸⠘⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠀⠀⠀⠀⠀⡇⠀⠀⢀⠈⡇⡎⢸⡀⡇⠀⢣⡇⠀⢠⢳⠁⠀⠀⠀⠸⡜⡇⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠀⠀⠀⠀⢀⠇⠀⡀⣿⠀⢸⢷⠻⡇⡇⠀⢸⢇⠀⢸⡸⠀⠀⠀⠀⠀⡇⣧⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠀⠀⠀⠀⢸⠀⢰⠙⠏⡆⠈⢸⢸⣷⠁⠀⢸⠘⠱⢪⠃⠀⠀⠀⠀⠀⠇⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠀⠀⠀⠀⢸⠀⡜⠀⢠⠘⢢⢇⣼⢸⠀⠀⠘⡄⠀⢸⠀⠀⠀⠀⠀⠀⠀⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
   Score     │⠀⠀⠀⠀⢸⡰⠁⡀⣿⠀⠀⢸⣿⠘⠀⠀⠀⡇⠀⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠀⠀⠀⡤⠏⡇⠀⣷⣿⠀⢠⣸⢿⠀⠀⠀⠀⡇⣄⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠀⠀⡀⡇⢸⠀⢠⢻⡟⡴⡜⠃⠀⠀⠀⠀⠀⠸⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠀⠀⡇⡇⡜⠀⢸⠘⠃⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠀⢸⢸⢣⠃⠀⡎⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠀⣜⢾⠇⢀⢰⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             │⠨⠏⢨⠀⡿⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
           0 │⠈⠙⡜⣿⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
             └────────────────────────────────────────┘ 
             ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀70⠀ 
             ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀Episode⠀⠀⠀⠀⠀⠀

Thank you for taking time for this :slight_smile:

Is the code example supposed to work? I get the following running it:

ERROR: LoadError: BoundsError: attempt to access 8-element BitVector at index [1, 2]
Stacktrace:
 [1] throw_boundserror(A::BitVector, I::Tuple{Int64, Int64})
   @ Base ./abstractarray.jl:651
 [2] checkbounds
   @ ./abstractarray.jl:616 [inlined]
 [3] _setindex!
   @ ./abstractarray.jl:1289 [inlined]
 [4] setindex!
   @ ./abstractarray.jl:1267 [inlined]
 [5] MultiThreadEnv(envs::Vector{CartPoleEnv{Float32, StableRNGs.LehmerRNG}})
   @ ReinforcementLearningZoo ~/.julia/packages/ReinforcementLearningZoo/3LTNB/src/algorithms/policy_gradient/multi_thread_env.jl:73
 [6] top-level scope
   @ ~/Documents/SplinterBot/test.jl:15
 [7] include(fname::String)
   @ Base.MainInclude ./client.jl:444
 [8] top-level scope
   @ REPL[1]:1

My own environment has a similar error (for whatever reason) m_batch ends up being a BitVector of size 1x1 (expected is 96x1), which obv results in a BoundsError. Running this in the repl does not give an error. Hints would be appreciated but i will experiment more with it later when i have more time.

lol, nevermind. I just discovered Debugger.jl (i’m also new to julia) and saw that even though i did ] update in the repl the code getting executed has not the changes you made … I will fix that and see again.

So i had to delete my ~/.julia folder and readd the package. Then your example succeeds and i’m getting another error (which is further down the line so this is good iguess :smiley:).

The error is the following:

The legal_action_space_mask function from MultiThreadEnv yields the following error:

ERROR: LoadError: TaskFailedException

    nested task error: DimensionMismatch("array could not be broadcast to match destination")
    Stacktrace:
     [1] check_broadcast_shape
       @ ./broadcast.jl:520 [inlined]
     [2] check_broadcast_axes
       @ ./broadcast.jl:523 [inlined]
     [3] instantiate
       @ ./broadcast.jl:269 [inlined]
     [4] materialize!(#unused#::Base.Broadcast.DefaultArrayStyle{1}, dest::SubArray{Bool, 1, BitMatrix, Tuple{Int64, Base.Slice{Base.OneTo{Int64}}}, true}, bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(identity), Tuple{Vector{Bool}}})
       @ Base.Broadcast ./broadcast.jl:894
     [5] materialize!
       @ ./broadcast.jl:891 [inlined]
     [6] (::ReinforcementLearningZoo.var"#133#134"{MultiThreadEnv{SplinterlandsEnv, Vector{Vector{Int64}}, Vector{Int64}, Space{Vector{Base.OneTo{Int64}}}, Space{Vector{Vector{Int64}}}, BitMatrix}, Int64, Int64})()
       @ ReinforcementLearningZoo ./threadingconstructs.jl:169

...and 7 more exceptions.

Stacktrace:
  [1] sync_end(c::Channel{Any})
    @ Base ./task.jl:369
  [2] macro expansion
    @ ./task.jl:388 [inlined]
  [3] legal_action_space_mask(env::MultiThreadEnv{SplinterlandsEnv, Vector{Vector{Int64}}, Vector{Int64}, Space{Vector{Base.OneTo{Int64}}}, Space{Vector{Vector{Int64}}}, BitMatrix})
    @ ReinforcementLearningZoo ~/.julia/packages/ReinforcementLearningZoo/0eUX3/src/algorithms/policy_gradient/multi_thread_env.jl:135
  [4] prob
    @ ~/.julia/packages/ReinforcementLearningZoo/0eUX3/src/algorithms/policy_gradient/ppo.jl:174 [inlined]
  [5] (::Agent{PPOPolicy{ActorCritic{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, ADAM}, Distributions.Categorical{P, Ps} where {P<:Real, Ps<:AbstractVector{P}}, Random._GLOBAL_RNG}, Trajectory{NamedTuple{(:action_log_prob, :state, :legal_actions_mask, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float32, 2}, CircularArrayBuffers.CircularArrayBuffer{Int64, 3}, CircularArrayBuffers.CircularArrayBuffer{Bool, 3}, CircularArrayBuffers.CircularArrayBuffer{Int64, 2}, CircularArrayBuffers.CircularArrayBuffer{Int64, 2}, CircularArrayBuffers.CircularArrayBuffer{Bool, 2}}}}})(env::MultiThreadEnv{SplinterlandsEnv, Vector{Vector{Int64}}, Vector{Int64}, Space{Vector{Base.OneTo{Int64}}}, Space{Vector{Vector{Int64}}}, BitMatrix})
    @ ReinforcementLearningZoo ~/.julia/packages/ReinforcementLearningZoo/0eUX3/src/algorithms/policy_gradient/ppo.jl:189
  [6] _run(policy::Agent{PPOPolicy{ActorCritic{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, ADAM}, Distributions.Categorical{P, Ps} where {P<:Real, Ps<:AbstractVector{P}}, Random._GLOBAL_RNG}, Trajectory{NamedTuple{(:action_log_prob, :state, :legal_actions_mask, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float32, 2}, CircularArrayBuffers.CircularArrayBuffer{Int64, 3}, CircularArrayBuffers.CircularArrayBuffer{Bool, 3}, CircularArrayBuffers.CircularArrayBuffer{Int64, 2}, CircularArrayBuffers.CircularArrayBuffer{Int64, 2}, CircularArrayBuffers.CircularArrayBuffer{Bool, 2}}}}}, env::MultiThreadEnv{SplinterlandsEnv, Vector{Vector{Int64}}, Vector{Int64}, Space{Vector{Base.OneTo{Int64}}}, Space{Vector{Vector{Int64}}}, BitMatrix}, stop_condition::StopAfterEpisode{ProgressMeter.Progress}, hook::TotalBatchRewardPerEpisode)
    @ ReinforcementLearningZoo ~/.julia/packages/ReinforcementLearningZoo/0eUX3/src/algorithms/policy_gradient/run.jl:17
  [7] run(policy::Agent{PPOPolicy{ActorCritic{Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, Chain{Tuple{Dense{typeof(relu), Matrix{Float32}, Vector{Float32}}, Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, ADAM}, Distributions.Categorical{P, Ps} where {P<:Real, Ps<:AbstractVector{P}}, Random._GLOBAL_RNG}, Trajectory{NamedTuple{(:action_log_prob, :state, :legal_actions_mask, :action, :reward, :terminal), Tuple{CircularArrayBuffers.CircularArrayBuffer{Float32, 2}, CircularArrayBuffers.CircularArrayBuffer{Int64, 3}, CircularArrayBuffers.CircularArrayBuffer{Bool, 3}, CircularArrayBuffers.CircularArrayBuffer{Int64, 2}, CircularArrayBuffers.CircularArrayBuffer{Int64, 2}, CircularArrayBuffers.CircularArrayBuffer{Bool, 2}}}}}, env::MultiThreadEnv{SplinterlandsEnv, Vector{Vector{Int64}}, Vector{Int64}, Space{Vector{Base.OneTo{Int64}}}, Space{Vector{Vector{Int64}}}, BitMatrix}, stop_condition::StopAfterEpisode{ProgressMeter.Progress}, hook::TotalBatchRewardPerEpisode)
    @ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/Gdaf7/src/core/run.jl:10

This happens because:

julia> selectdim(env.legal_action_space_mask, 1, 1)
8-element view(::BitMatrix, 1, :) with eltype Bool:
 0
 0
 0
 0
 0
 0
 0
 0

julia> legal_action_space_mask(env[1])
96-element Vector{Bool}:
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 1
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 0

I’m not sure what exactly this function is supposed to do. It seems to me that the purpose is to save the actual legal_action_space_mask inside env.legal_action_space_mask (for each env in the MultiThreadEnv that is), but (assuming that actually is what the function is for) env.legal_action_space_mask already has the correct form without using that function.

[I should definitively read about how to use local packages and try things out myself before asking and taking someone elses time.]

It seems like a misconfiguration of the MaskedPPOTrajectory. Your environment has an action space of length 96. But the allocated legal_action_space_mask is only of length 8. So I guess you may need to set the na to 96 in Using PPOPolicy with custom environment with action masking in ReinforcementLearning.jl - #2 by findmyway ?


You don't need to remove the whole `~/.julia` folder. Just enter the `> pkg` mode in the Julia REPL with `]` and then run `up` to update dependencies.

Well i use ns, na = length(state(env[1])), length(action_space(env[1])) and running it in the REPL yields:

julia> ns, na = length(state(env[1])), length(action_space(env[1]))
(1, 96)

So this should be good.

I am using 8 (identical) Environments in the MultiThreadEnv and the allocated legal_action_space_mask is of length 8, when i call selectdim(env.legal_action_space_mask, 1, 6) then i see the expected value of [1, 1, 1, 1, 1, 1, 1, 1] because the sixth element of legal_action_space_mask in these envs is 1. It seems that selectdim(env.legal_action_space_mask, 2, X) should be used as this actually yields the correct legal_action_space_mask. I suspect this should be N = ndims(env.states[1]) or if the envs are allowed to have different mask lengths this should be moved into the loop and become N = ndims(env.states[i])?

As for the update: If up is a shorthand for update this is exactly what i did, but Debugger.jl showed me that the new code wasn’t used. To me this seems like a caching thing or something similar (deleteing ~/.julia/ is a drastic measure, i know :smiley:). But the same happens when i want to change the ReinforcementLearning source. I did ] dev ReinforcementLearning but while changes actually lead to julia issuing a info message that ReinforcementLearning gets precompiled when i run the code i can see in Debugger.jl that my changes are not there.

It would be easier for me to debug if you could provide the definition of your environment.

If that is the case, then the legal_action_space_mask of the MultiThreadEnv should be of size (91, 8), the last dimension is the number of the environments.

That’s kind of strange. If the instructions in Tips for Developers · ReinforcementLearning.jl? doesn’t work for you, could you create an issue and describe what you have tried in details?

Sure here is the code and here is the used JSON. The (rather primitive) goal is to learn not to exceed the manacap :smiley:

That sounds like an interesting game, though I do not fully understand the rule yet :laughing:

So if I understand it correctly, the state of the game is just a scalar. And you used a vector of length 1 to store it. There are at least two problems here:

  1. The state and the state_space do not align:
julia> state(env[1]) in state_space(env[1])
false
  1. Deep RL algorithms like PPO you used here usually expect the state of the environment to be a tensor. So you may consider using an embedding to represent each state.

Following are some suggested change to your environment:

 RLBase.action_space(::SplinterlandsEnv) = Base.OneTo(length(Cards) + 1)
-RLBase.state_space(::SplinterlandsEnv) = [i for i in 1:DoneState]
+RLBase.state_space(::SplinterlandsEnv) = Space([(false, true) for i in 1:DoneState ])
 
-RLBase.state(env::SplinterlandsEnv) = env.state
+RLBase.state(env::SplinterlandsEnv) = Flux.onehot(env.state[1], 1:DoneState)
 RLBase.reward(env::SplinterlandsEnv) = env.usedMana > env.manaCap ? -10 : 1

 function RLBase.legal_action_space(env::SplinterlandsEnv)
@@ -101,6 +101,8 @@ function RLBase.legal_action_space(env::SplinterlandsEnv)
             return White
         elseif env.deck[1] in sBlack
             return Black
+        else
+            []
         end
     else
         if env.deck[1] in sRed
@@ -113,6 +115,8 @@ function RLBase.legal_action_space(env::SplinterlandsEnv)
             return setdiff(cWhite, env.deck)
         elseif env.deck[1] in sBlack
             return setdiff(cBlack, env.deck)
+        else
+            []
         end
     end
 end

-stop_condition = StopAfterEpisode(8)
-# run(agent, env, stop_condition, hook)
\ No newline at end of file
+stop_condition = StopAfterStep(1000)   # You can't use StopAfterEpisode in a MultiThreadEnv
+run(agent, env, stop_condition, hook)

I also found that the DoneState is not

2 Likes