For POMDPs.jl I have a big state space (bigger than 2.000 possible states). From certain states, I cannot take some actions, that is state-dependent actions.
I have check the documentation (link here). And for TabularTDLearning I guess I cannot use a function to determine all the possible actions for a state, so I gave a bad reward to impossible actions and the next state is the previous state, but as the state-space is quite big it doesn’t find the optimal policy.
Do you think it is the right way?
Bump, and I will ping @zsunberg for you.
Current implementations in
TabularTDLearning don’t support action masking and a PR would be very welcome.
ReinforcementLearning.jl probably has better support there. POMDPs.jl (and other solvers in the ecosystem) do support state dependent legal actions; you just need to define
actions(mdp, s) for your problem to define the legal action set for each state
Thank you for your answer, but let me ask you something more expecific about my problem.
I have a huge state space and action space, from most of the states the agent cannot take most of the actions, so when the agent wants to take ‘‘impossible actions’’ he receives a very bad reward and goes back to the beggining. In very simple problems it finds the solution but in others no.
I tried to create a function that given a state tells you all the possible actions and then apply the qlearning solver.
Is it possible?