Help I am new: Is this MDP righ?

Hi, first a comment on formatting code: you could enclose your complete MWE in triple back ticks and it would show as code in Discourse.

That said your own definition

struct MyMDP <: MDP{Int,Int} end

does indeed not contain a member indices which these functions

POMDPs.stateindex(m::MyMDP, s) = m.indices[s]
POMDPs.actionindex(m::MyMDP, a) = m.indices[a]

try to access. How did you get the idea this could work?

Hello,

First of all thank you so much for your quick response and sorry for the format.

I tried to follow the documentation. How can I define the member indices?

Thanks!

Thanks. Your example looks a bit like this one. I tried

struct MyMDP <: MDP{Int,Int} 
    indices::Dict{String, Int}
    MyMDP() = new(Dict{Int, Int}())
end

but it looks to me like you have to fill the index dictionary with some state action pairs like

Dict{Int, Int}(0=>? ...)

?

Hi @goerch,

I’ve tried what you said:

using POMDPs
using POMDPModelTools: Deterministic, Uniform, SparseCat
using TabularTDLearning
using POMDPPolicies
using POMDPModels
struct MyMDP <: MDP{Int,Int}
indices::Dict{Int, Int}
MyMDP() = new(Dict(0=>1,1=>2,2=>2,3=>4))
end
mdp = MyMDP()
POMDPs.actions(m::MyMDP) = [0,1,2,3]
POMDPs.states(m::MyMDP) = [0,1,2,3]
POMDPs.discount(m::MyMDP) = 0.95
POMDPs.stateindex(m::MyMDP, s) = m.indices[s]
POMDPs.actionindex(m::MyMDP, a) = m.indices[a]
POMDPs.initialstate(m::MyMDP) = Uniform([0,1,2,3])
function POMDPs.transition(m::MyMDP, s, a)
if s == 0 && a == 0
return SparseCat([0,1,2,3], [1,0,0,0])
elseif s == 0 && a == 1
return SparseCat([0,1,2,3], [0.7, 0.3,0,0])
elseif s == 0 && a == 2
return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
elseif s == 0 && a == 3
return SparseCat([0,1,2,3], [0,0.2,0.5,0.3])
elseif s == 1 && a == 0
return SparseCat([0,1,2,3], [0.7, 0.3,0,0])
elseif s == 1 && a == 1
return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
elseif s == 1 && a == 2
return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
elseif s == 2 && a == 0
return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
elseif s == 2 && a == 1
return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
elseif s == 3 && a == 0
return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
end
end
function POMDPs.reward(m::MyMDP, s, a)
if s == 0 && a == 0
return -45
elseif s == 0 && a == 1
return -40
elseif s == 0 && a == 2
return -50
elseif s == 0 && a == 3
return -70
elseif s == 1 && a == 0
return -14
elseif s == 1 && a == 1
return -44
elseif s == 1 && a == 2
return -54
elseif s == 1 && a == 3
return -74
elseif s == 2 && a == 0
return -8
elseif s == 2 && a == 1
return -38
elseif s == 3 && a == 0
return -12
end
end
q_learning_solver = QLearningSolver(n_episodes=10,
learning_rate=0.8,
exploration_policy=EpsGreedyPolicy(mdp, 0.5),
verbose=false);
q_learning_policy = solve(q_learning_solver, mdp);

But I get “ERROR: ArgumentError: Sampler for this object is not defined” for

q_learning_policy = solve(q_learning_solver, mdp);

I have also tried

exppolicy = EpsGreedyPolicy(mdp, 0.01)
solver = QLearningSolver(exppolicy, learning_rate=0.1, n_episodes=50, max_episode_length=50, eval_every=50, n_eval_traj=100)
policy = solve(solver, mdp)

That I found here but in

solver = QLearningSolver(exppolicy, learning_rate=0.1, n_episodes=50, max_episode_length=50, eval_every=50, n_eval_traj=100)

It says: ERROR: MethodError: no method matching QLearningSolver(::EpsGreedyPolicy{POMDPPolicies.var"#17#18"{Float64}, Random._GLOBAL_RNG, Vector{Int64}}; learning_rate=0.1, n_episodes=50, max_episode_length=50, eval_every=50, n_eval_traj=100)

I do not undertand what happens. Thank you for your help.

I instrumented transition and reward

function POMDPs.transition(m::MyMDP, s, a)
    if s == 0 && a == 0
        return SparseCat([0,1,2,3], [1,0,0,0])
[...]
    else
        @assert false
    end    
end

function POMDPs.reward(m::MyMDP, s, a)
    if s == 0 && a == 0
        return -45
[...]
    else
        @assert false
    end        
end    

and get

ERROR: AssertionError: false

in transition. So it seems this function is incomplete?

Hello,

I have also tried:

function POMDPs.transition(m::ClinicalTrialsMDP, s, a)
    if s == 0 && a == 0 
        return SparseCat([0,1,2,3], [1,0,0,0])
    elseif s == 0 && a == 1 
        return SparseCat([0,1,2,3], [0.7, 0.3,0,0])
    elseif s == 0 && a == 2 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
    elseif s == 0 && a == 3 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])

    elseif s == 1 && a == 0 
        return SparseCat([0,1,2,3], [0.7, 0.3,0,0])
    elseif s == 1 && a == 1 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
    elseif s == 1 && a == 2 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
    elseif s == 1 && a == 3 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
        
    elseif s == 2 && a == 0 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
    elseif s == 2 && a == 1 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    elseif s == 2 && a == 2 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    elseif s == 2 && a == 3 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])

    elseif s == 3 && a == 0 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    elseif s == 3 && a == 1 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    elseif s == 3 && a == 2 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    elseif s == 3 && a == 3 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    end
end 


function POMDPs.reward(m::ClinicalTrialsMDP, s, a)
    if s == 0 && a == 0 
        return -45
    elseif s == 0 && a == 1 
        return -40
    elseif s == 0 && a == 2 
        return -50
    elseif s == 0 && a == 3 
        return -70
    
    elseif s == 1 && a == 0 
        return -14
    elseif s == 1 && a == 1 
        return -44
    elseif s == 1 && a == 2 
        return -54
    elseif s == 1 && a == 3 
        return -74   

    elseif s == 2 && a == 0 
        return -8
    elseif s == 2 && a == 1 
        return -38
    
    elseif s == 3 && a == 0 
        return -12
    end
end 

But still not working.

The following variation

using POMDPs
using POMDPModelTools
using QuickPOMDPs: QuickPOMDP
using TabularTDLearning
using POMDPPolicies
using POMDPModels
using Parameters, Random

struct MyMDP <: MDP{Int,Int} 
    indices::Dict{Int, Int}
    MyMDP() = new(Dict{Int, Int}(0=>1,1=>2,2=>3,3=>4))
end

mdp = MyMDP()

POMDPs.actions(m::MyMDP) = [0,1,2,3]
POMDPs.states(m::MyMDP) = [0,1,2,3]
POMDPs.discount(m::MyMDP) = 0.95
POMDPs.stateindex(m::MyMDP, s) = m.indices[s]
POMDPs.actionindex(m::MyMDP, a) = m.indices[a]
POMDPs.initialstate(m::MyMDP) = Uniform([0,1,2,3])

function POMDPs.transition(m::MyMDP, s, a)
    if s == 0 && a == 0 
        return SparseCat([0,1,2,3], [1,0,0,0])
    elseif s == 0 && a == 1 
        return SparseCat([0,1,2,3], [0.7, 0.3,0,0])
    elseif s == 0 && a == 2 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
    elseif s == 0 && a == 3 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])

    elseif s == 1 && a == 0 
        return SparseCat([0,1,2,3], [0.7, 0.3,0,0])
    elseif s == 1 && a == 1 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
    elseif s == 1 && a == 2 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
    elseif s == 1 && a == 3 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
        
    elseif s == 2 && a == 0 
        return SparseCat([0,1,2,3], [0.2, 0.5,0.3,0])
    elseif s == 2 && a == 1 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    elseif s == 2 && a == 2 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    elseif s == 2 && a == 3 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])

    elseif s == 3 && a == 0 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    elseif s == 3 && a == 1 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    elseif s == 3 && a == 2 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    elseif s == 3 && a == 3 
        return SparseCat([0,1,2,3], [0, 0.2,0.5,0.3])
    else
        @show "transition", s, a
        return Uniform([0, 1, 2, 3])
    end    
end

function POMDPs.reward(m::MyMDP, s, a)
    if s == 0 && a == 0
        return -45
    elseif s == 0 && a == 1
        return -40
    elseif s == 0 && a == 2
        return -50
    elseif s == 0 && a == 3
        return -70
    elseif s == 1 && a == 0
        return -14
    elseif s == 1 && a == 1
        return -44
    elseif s == 1 && a == 2
        return -54
    elseif s == 1 && a == 3
        return -74  
    elseif s == 2 && a == 0
        return -8
    elseif s == 2 && a == 1
        return -38
    elseif s == 3 && a == 0
        return -12
    else
        @show "reward", s, a
        return 0
    end        
end    

q_learning_solver = QLearningSolver(n_episodes=10,
                                learning_rate=0.8,
                                exploration_policy=EpsGreedyPolicy(mdp, 0.5),
                                verbose=false);
q_learning_policy = solve(q_learning_solver, mdp);    

does something for me without showing errors. I don’t know if it is what you intended it to do.

Hello,

Thank you so much for your help! I t works but I need to interpret first the results. Can you explain me why you use in transition:

else
        @show "transition", s, a
        return Uniform([0, 1, 2, 3])
    end    

and in reward

else
        @show "reward", s, a
        return 0
    end   

I only tried to make the functions well defined, i.e. returning some data in every case (which seems required). It at least showed that some cases in reward where not defined…

Hello,

I understand, but doing like you did means that every transition is possible. What if I am in state 2 (s==2) and I can only take action 0 and 1 (a==0, a==1)?

If I had understand your code, when you do :

 @show "transition", s, a
        return Uniform([0, 1, 2, 3])
    end    
end

Every transition is possible, right?