Understanding MultiAgent in ReinforcementLearning.jl

Hello, let it be known that I know little about Julia and even less about Reinforcement Learning.

I’ve written an environment so trivial it is meaningless just to understand how everything fits together. It’s a four-player game for players named 1, 2, 3 4. Each episode starts at player 1 and goes through 4. Each action simply updates env.last_action to the last player. The reward is whatever is in env.last_player. is_terminated evaluates to true after the fourth player has played. Notice it’s a TERMINAL_REWARD game, and it’s a meaningless game so no need to understand the logic. Both state and action exist but are worthless Int64s. The full code is at the bottom of this message.

I then throw four agents at the game.

So, questions:

  1. While running this for the first time I got a MethodError like below after all four players played. I’m guessing that when the is_terminated. I’m guessing that when is_terminated happens, the approximator is called with (Int, NoOp) parameters instead of (Int, Int) - probably because the future state does not exist (and is a “NoOp”) ? Is this how it’s supposed to work or am I already wrong at this point?

Notice that to make it work I provided a (::Int64, ::NoOp) version of this method that does just replaces the NoOp with a different Int64, assuming that it needs a Int64 that maps to “end of game” or something like that.

ERROR: LoadError: MethodError: no method matching (::TabularQApproximator{Matrix{Float64}, Flux.Optimise.InvDecay})(::Int64, ::NoOp)
Closest candidates are:
  (::TabularQApproximator{T, O} where {T<:(AbstractArray{Float64, N} where N), O})(::Int64) at /home/thae/julia/packages/ReinforcementLearningCore/B6qPK/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:32
  (::TabularQApproximator{T, O} where {T<:(AbstractArray{Float64, N} where N), O})(::Int64, ::Int64) at /home/thae/julia/packages/ReinforcementLearningCore/B6qPK/src/policies/q_based_policies/learners/approximators/tabular_approximator.jl:33
  1. After adding the I-assume-it’s-a-hack to fix the issue above (reproduced in the code below), I run the code. Notice that there are some printlns. Here’s how they run:
Action for player 1
reward(2)
reward(2)
reward(2)
reward(2)
Action for player 2
reward(3)
reward(3)
reward(3)
reward(3)
Action for player 3
reward(4)
reward(4)
reward(4)
reward(4)
Action for player 4
reward(1)
reward(1)
reward(1)
reward(1)

I assume this means that each of the agents (of which there are four) is polling for the rewards of the given player after each action. However:

  • Why does it poll for rewards after each action if I am using TERMINAL_REWARD? Shouldn’t it just peek into the rewards once at the end of the episode?

  • The number there is the player whose reward is being polled. Notice that after player 1’s action the reward that is being polled is player 2’s. Why is this so? I’m guessing the problem is that at the end of the action I am moving the current player to the next one, so reward() gets called with the future player. How am I supposed to handle incrementing the player correctly?

  • In this code there’s no meaningful reward logic but I want to assign a separate reward for each player at the end of the episode - I’d expect to see calls to reward() for each of the four players after the last step, not only for a single player. Do I misunderstand how TERMINAL_REWARD works?

Thank you!

using ReinforcementLearning

mutable struct FooEnv <: AbstractEnv
  last_action::Int
  current_player::Int
end

RLBase.reset!(env::FooEnv) = FooEnv(0, 1)

RLBase.players(env::FooEnv) = collect(1:4)
RLBase.current_player(env::FooEnv) = env.current_player

RLBase.action_space(env::FooEnv, p) = [1,2,3]

RLBase.state_space(env::FooEnv) = Base.OneTo(2)
RLBase.state(env::FooEnv) = 1
(app::TabularQApproximator)(s::Int, a::NoOp) = app.table[3, s]

function (env::FooEnv)(action::Int, player)
  println("Action for player $player")
  env.last_action = player
  env.current_player += 1
  if env.current_player > 4
    env.current_player = 1
  end
end

RLBase.is_terminated(env::FooEnv) = env.last_action == 4

function RLBase.reward(env::FooEnv, player)
  println("reward($player)")
  env.last_action
end

RLBase.NumAgentStyle(::FooEnv) = MultiAgent(4)
RLBase.DynamicStyle(::FooEnv) = SEQUENTIAL
RLBase.StateStyle(::FooEnv) = Observation{Int}()
RLBase.RewardStyle(::FooEnv) = TERMINAL_REWARD
RLBase.UtilityStyle(::FooEnv) = GENERAL_SUM
RLBase.ChanceStyle(::FooEnv) = STOCHASTIC

env = FooEnv(0, 1)

#RLBase.test_runnable!(env)

function build_agent()
  approximator = TabularQApproximator(
                         ;n_state = length(state_space(env)),
                         n_action = length(action_space(env)),
                     )
  policy = QBasedPolicy(
             learner = MonteCarloLearner(;
                     approximator=approximator,
                     kind=EVERY_VISIT,
                 ),
             explorer = EpsilonGreedyExplorer(0.01,  warmup_steps=10000, step=1)
         )

  agent = Agent(
    policy = policy,
    trajectory = VectorSARTTrajectory(; state=Int, action=Union{Int64,NoOp}, reward=Int, terminal=Bool)
  )
  agent
end

multiagent = MultiAgentManager(
    NamedPolicy(1 => build_agent()),
    NamedPolicy(2 => build_agent()),
    NamedPolicy(3 => build_agent()),
    NamedPolicy(4 => build_agent()),
)


run(multiagent, env, StopAfterEpisode(1))

Hello @thae ,

All are good questions. The sad truth is, MARL related algorithms are not well explored in RL.jl. I’ll explain MultiAgentManager a little bit more first.

The most naive way to solve MARL problems is to transform them into single agent RL problems and then we can reuse many existing solutions (though doing so is problematic in many cases). That’s the basic idea of MultiAgentManager in RL.jl. And it has several assumptions:

  • All the players play alternatively. Only one player is about to take an action at each step.
  • After each step, all the players can observe the env independently, including state and reward.
  • For players who are not the current player, an action of NO_OP is assumed to have been taken.
  • The typical trajectory in a two-player environment is like this:
S1_p1, A1_p1, R1_p1, T1, S2_p1, NO_OP, r2_p1, ..., St_p1, At_p1, Rt_p1, Tt
S1_p2, NO_OP, R1_p2, T1, S2_p2, A2_p2, r2_p2, ..., St_p2, At_p1, Rt_p1, Tt
S -> State
A -> Action
R -> Reward
T -> Terminal
t -> time step
p1 -> player 1
p2 -> player 2

If any of the above assumptions doesn’t meet what you expected, then you’d better to write your own multi agent manager.

Now let me answer your question one by one.

As I illustrated above, it is the ACTION taken by the the non-active players (in the view of MultiAgentManager). So you can simply set the element type of actions in the trajectory to be Union{Int, NoOp}. (You did it right!)

Yes, it assumes the env is of StepReward style. And do not treat TerminalReward style env especially. (We should definitely support it!)

You should call reward(env, player) to fetch/implement the reward of specific player. Otherwise, by default reward(env) only returns the reward of the current player.

This has nothing to do with TERMINAL_REWARD style environments. I think my above answer already answered this question?

If this is quite counterintuitive, please file an issue :grinning: at Issues · JuliaReinforcementLearning/ReinforcementLearning.jl · GitHub

1 Like

Thanks a lot! That answers most or all of my doubts. The typical two-player trajectory example was particularly useful.