In order to use Action Masking with he PPOPolicy i think i have to use a MaskedPPOTrajectory. The legal_actions_mask gets updated in the update!
function with the result of legal_action_space_mask(env)
, but afaiu this should always be an AbstractArray of type Bool with the same length as the action_space. Now i get the following error:
LoadError: MethodError: no method matching CircularArrayBuffers.CircularArrayBuffer(::Array{Bool, 97}, ::Int64, ::Int64, ::Bool)
Closest candidates are:
CircularArrayBuffers.CircularArrayBuffer(::Array{T, N}, ::Int64, ::Int64, ::Int64) where {T, N} at /home/mfg/.julia/packages/CircularArrayBuffers/hxJJh/src/CircularArrayBuffers.jl:12
CircularArrayBuffers.CircularArrayBuffer(::AbstractArray{T, N}) where {T, N} at /home/mfg/.julia/packages/CircularArrayBuffers/hxJJh/src/CircularArrayBuffers.jl:27
Stacktrace:
[1] (CircularArrayBuffers.CircularArrayBuffer{Bool, N} where N)(::Bool, ::Vararg{Integer, N} where N)
@ CircularArrayBuffers ~/.julia/packages/CircularArrayBuffers/hxJJh/src/CircularArrayBuffers.jl:24
[2] (::ReinforcementLearningCore.var"#51#52"{Int64})(x::Pair{DataType, Vector{Bool}})
@ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/Gdaf7/src/policies/agents/trajectories/trajectory.jl:58
[3] map(f::ReinforcementLearningCore.var"#51#52"{Int64}, t::Tuple{Pair{DataType, Tuple{Int64, Int64}}, Pair{DataType, Vector{Bool}}, Pair{DataType, Tuple{Int64}}})
@ Base ./tuple.jl:215
[4] map(::Function, ::NamedTuple{(:state, :legal_actions_mask, :action), Tuple{Pair{DataType, Tuple{Int64, Int64}}, Pair{DataType, Vector{Bool}}, Pair{DataType, Tuple{Int64}}}})
@ Base ./namedtuple.jl:197
[5] CircularArrayTrajectory(; capacity::Int64, kwargs::Base.Iterators.Pairs{Symbol, Pair{DataType, B} where B, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:state, :legal_actions_mask, :action), Tuple{Pair{DataType, Tuple{Int64, Int64}}, Pair{DataType, Vector{Bool}}, Pair{DataType, Tuple{Int64}}}}})
@ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/Gdaf7/src/policies/agents/trajectories/trajectory.jl:57
[6] (CircularArraySLARTTrajectory{var"#s90"} where var"#s90"<:(NamedTuple{(:state, :legal_actions_mask, :action, :reward, :terminal), var"#s54"} where var"#s54"<:(Tuple{var"#s5", var"#s4", var"#s3", var"#s1", var"#s91"} where {var"#s5"<:CircularArrayBuffers.CircularArrayBuffer, var"#s4"<:CircularArrayBuffers.CircularArrayBuffer, var"#s3"<:CircularArrayBuffers.CircularArrayBuffer, var"#s1"<:CircularArrayBuffers.CircularArrayBuffer, var"#s91"<:CircularArrayBuffers.CircularArrayBuffer})))(; capacity::Int64, state::Pair{DataType, Tuple{Int64, Int64}}, legal_actions_mask::Pair{DataType, Vector{Bool}}, action::Pair{DataType, Tuple{Int64}}, reward::Pair{DataType, Tuple{Int64}}, terminal::Pair{DataType, Tuple{Int64}})
@ ReinforcementLearningCore ~/.julia/packages/ReinforcementLearningCore/Gdaf7/src/policies/agents/trajectories/trajectory.jl:194
[7] (Trajectory{var"#s269"} where var"#s269"<:(NamedTuple{(:action_log_prob, :state, :legal_actions_mask, :action, :reward, :terminal), var"#s268"} where var"#s268"<:(Tuple{var"#s267", var"#s266", var"#s262", var"#s261", var"#s260", var"#s200"} where {var"#s267"<:CircularArrayBuffers.CircularArrayBuffer, var"#s266"<:CircularArrayBuffers.CircularArrayBuffer, var"#s262"<:CircularArrayBuffers.CircularArrayBuffer, var"#s261"<:CircularArrayBuffers.CircularArrayBuffer, var"#s260"<:CircularArrayBuffers.CircularArrayBuffer, var"#s200"<:CircularArrayBuffers.CircularArrayBuffer})))(; capacity::Int64, action_log_prob::Pair{DataType, Tuple{Int64}}, kwargs::Base.Iterators.Pairs{Symbol, Pair{DataType, B} where B, NTuple{5, Symbol}, NamedTuple{(:state, :action, :reward, :terminal, :legal_actions_mask), Tuple{Pair{DataType, Tuple{Int64, Int64}}, Pair{DataType, Tuple{Int64}}, Pair{DataType, Tuple{Int64}}, Pair{DataType, Tuple{Int64}}, Pair{DataType, Vector{Bool}}}}})
@ ReinforcementLearningZoo ~/.julia/packages/ReinforcementLearningZoo/3LTNB/src/algorithms/policy_gradient/ppo.jl:41
I guess i’m doing it wrong, but i couldn’t find any examples using MaskedPPOTrajectory.
I am using this with a custom environment and it’s possible to use this environment with the RandomPolicy and with QBasedPolicy.
I am using Julia 1.6.3.
Any help appreciated