Hello, let it be known that I know little about Julia and even less about Reinforcement Learning.
I’ve written an environment so trivial it is meaningless just to understand how everything fits together. It’s a four-player game for players named 1, 2, 3 4. Each episode starts at player 1 and goes through 4. Each action simply updates env.last_action to the last player. The reward is whatever is in env.last_player. is_terminated evaluates to true after the fourth player has played. Notice it’s a TERMINAL_REWARD game, and it’s a meaningless game so no need to understand the logic. Both state and action exist but are worthless Int64s. The full code is at the bottom of this message.
I then throw four agents at the game.
So, questions:
- While running this for the first time I got a MethodError like below after all four players played. I’m guessing that when the is_terminated. I’m guessing that when is_terminated happens, the approximator is called with (Int, NoOp) parameters instead of (Int, Int) - probably because the future state does not exist (and is a “NoOp”) ? Is this how it’s supposed to work or am I already wrong at this point?
Notice that to make it work I provided a (::Int64, ::NoOp) version of this method that does just replaces the NoOp with a different Int64, assuming that it needs a Int64 that maps to “end of game” or something like that.
ERROR: LoadError: MethodError: no method matching (::TabularQApproximator{Matrix{Float64}, Flux.Optimise.InvDecay})(::Int64, ::NoOp)
Closest candidates are:
(::TabularQApproximator{T, O} where {T<:(AbstractArray{Float64, N} where N), O})(::Int64) at /home/thae/julia/packages/ReinforcementLearningCore/B6qPK/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:32
(::TabularQApproximator{T, O} where {T<:(AbstractArray{Float64, N} where N), O})(::Int64, ::Int64) at /home/thae/julia/packages/ReinforcementLearningCore/B6qPK/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:33
- After adding the I-assume-it’s-a-hack to fix the issue above (reproduced in the code below), I run the code. Notice that there are some printlns. Here’s how they run:
Action for player 1
reward(2)
reward(2)
reward(2)
reward(2)
Action for player 2
reward(3)
reward(3)
reward(3)
reward(3)
Action for player 3
reward(4)
reward(4)
reward(4)
reward(4)
Action for player 4
reward(1)
reward(1)
reward(1)
reward(1)
I assume this means that each of the agents (of which there are four) is polling for the rewards of the given player after each action. However:
-
Why does it poll for rewards after each action if I am using TERMINAL_REWARD? Shouldn’t it just peek into the rewards once at the end of the episode?
-
The number there is the player whose reward is being polled. Notice that after player 1’s action the reward that is being polled is player 2’s. Why is this so? I’m guessing the problem is that at the end of the action I am moving the current player to the next one, so reward() gets called with the future player. How am I supposed to handle incrementing the player correctly?
-
In this code there’s no meaningful reward logic but I want to assign a separate reward for each player at the end of the episode - I’d expect to see calls to reward() for each of the four players after the last step, not only for a single player. Do I misunderstand how TERMINAL_REWARD works?
Thank you!
using ReinforcementLearning
mutable struct FooEnv <: AbstractEnv
last_action::Int
current_player::Int
end
RLBase.reset!(env::FooEnv) = FooEnv(0, 1)
RLBase.players(env::FooEnv) = collect(1:4)
RLBase.current_player(env::FooEnv) = env.current_player
RLBase.action_space(env::FooEnv, p) = [1,2,3]
RLBase.state_space(env::FooEnv) = Base.OneTo(2)
RLBase.state(env::FooEnv) = 1
(app::TabularQApproximator)(s::Int, a::NoOp) = app.table[3, s]
function (env::FooEnv)(action::Int, player)
println("Action for player $player")
env.last_action = player
env.current_player += 1
if env.current_player > 4
env.current_player = 1
end
end
RLBase.is_terminated(env::FooEnv) = env.last_action == 4
function RLBase.reward(env::FooEnv, player)
println("reward($player)")
env.last_action
end
RLBase.NumAgentStyle(::FooEnv) = MultiAgent(4)
RLBase.DynamicStyle(::FooEnv) = SEQUENTIAL
RLBase.StateStyle(::FooEnv) = Observation{Int}()
RLBase.RewardStyle(::FooEnv) = TERMINAL_REWARD
RLBase.UtilityStyle(::FooEnv) = GENERAL_SUM
RLBase.ChanceStyle(::FooEnv) = STOCHASTIC
env = FooEnv(0, 1)
#RLBase.test_runnable!(env)
function build_agent()
approximator = TabularQApproximator(
;n_state = length(state_space(env)),
n_action = length(action_space(env)),
)
policy = QBasedPolicy(
learner = MonteCarloLearner(;
approximator=approximator,
kind=EVERY_VISIT,
),
explorer = EpsilonGreedyExplorer(0.01, warmup_steps=10000, step=1)
)
agent = Agent(
policy = policy,
trajectory = VectorSARTTrajectory(; state=Int, action=Union{Int64,NoOp}, reward=Int, terminal=Bool)
)
agent
end
multiagent = MultiAgentManager(
NamedPolicy(1 => build_agent()),
NamedPolicy(2 => build_agent()),
NamedPolicy(3 => build_agent()),
NamedPolicy(4 => build_agent()),
)
run(multiagent, env, StopAfterEpisode(1))