DQN Reinforcement Learning Agent is Not Learning

Hello. I’m relatively new to Julia, I have only been coding in it the last 3 months. I’m trying to create a DQN RL agent that finds the optimal path while also choosing the maximum speed for doing so. However, it just won’t learn and I really can’t see why. Can someone please help me? I will add the script below.

using ReinforcementLearning
using Flux #Needed for all the Neural Networks functionalities
using Plots
using DelimitedFiles #Needed to read all the txt files
using PolygonOps
using Random
using Intervals

#GeoBoundariesManipulation
include(joinpath(pwd(),"GeoBoundariesManipulation.jl"));
using .GeoBoundariesManipulation

#My problem's parameters
struct ShippingEnvParams
    gridworld_dims::Tuple{Int64,Int64} #Gridworld dimensions
    velocities::Vector{Int64} #available velocities from 6 knots to 20 knots
    acceleration::Vector{Int64} #available acceleration per step: -2, 0, 2
    heading::Vector{CartesianIndex{2}} #all heading manoeuvers
    punishment::Int64 #punishment per ordinary step
    out_of_grid_punishment::Int64 #punishment for going towards an island or out of grid bounds
    StartingPoint::CartesianIndex{2}
    GoalPoint::CartesianIndex{2}
    all_polygons::Vector{Vector{Tuple{Float64,Float64}}} #all the boundaries
end

function ShippingEnvParams(;
    gridworld_dims = (50,50),
    velocities = Vector((6:2:20)), 
    acceleration = Vector((-2:2:2)), 
    heading = [CartesianIndex(0,1);CartesianIndex(0,-1);CartesianIndex(-1,0);CartesianIndex(-1,1);CartesianIndex(-1,-1);CartesianIndex(1,-1);CartesianIndex(1,1);CartesianIndex(1,0)], 
    punishment = -5, 
    out_of_grid_punishment = -10, 
    StartingPoint = GeoBoundariesManipulation.GoalPointToCartesianIndex((-6.733535,61.997345),gridworld_dims[1],gridworld_dims[2]),
    EndingPoint = GeoBoundariesManipulation.GoalPointToCartesianIndex((-6.691500,61.535580),gridworld_dims[1],gridworld_dims[2]),
    AllPolygons = GeoBoundariesManipulation.load_files("finalboundaries") 
    )
    ShippingEnvParams(
        gridworld_dims,
        velocities,
        acceleration,
        heading,
        punishment,
        out_of_grid_punishment,
        StartingPoint,
        EndingPoint,
        AllPolygons
    )
end

###ENVIRONMENT CONSTRUCTION
#Instance
mutable struct ShippingEnv <: AbstractEnv
    params::ShippingEnvParams
    action_space::Base.OneTo{Int64}
    #observation_space::Space{Vector{UnitRange{Int64}}} #state_space
    observation_space::Space{Vector{Interval{Int64,Closed,Closed}}}
    state::Vector{Int64} #state: (position,velocity)
    action::Int64 #action: (heading_angle,acceleration)
    done::Bool #checks if agent has reached its goal
    position::CartesianIndex{2}
    time::Float64
    velocity::Int64
    distance::Float64
    reward::Union{Nothing,Float64} 
end


function ShippingEnv()
    params1 = ShippingEnvParams()
    env = ShippingEnv(
        params1,
        Base.OneTo(length(params1.heading)*length(params1.acceleration)),
        #Space((1:length(params1.heading),1:length(params1.acceleration))), #Space: (1-number of heading options, 1-number of acceleration options)
        #Space([1..params.gridworld_dims[1]*params.gridworld_dims[2],minimum(params.velocities)..maximum(params.velocities)]),
        Space([0..1,0..1]),
        #Space([1:(params1.gridworld_dims[1]*params1.gridworld_dims[2]),(1:length(params1.velocities))]), #(1-number of grid tiles, 1-number of velocity options)
        [LinearIndices((params1.gridworld_dims[1],params1.gridworld_dims[2]))[params1.StartingPoint],1],
        rand(1:length(params1.heading)*length(params1.acceleration)), #put a random action
        false,
        params1.StartingPoint,
        0.0,
        params1.velocities[1],
        0.0,
        0.0
    )
    reset!(env)
    env
end

function state_normalization(m::ShippingEnv)
    max_st_position = m.params.gridworld_dims[1]*m.params.gridworld_dims[2]
    min_st_position = 1
    max_st_velocity = length(m.params.velocities)
    min_st_velocity = 1

    position = (m.state[1] - min_st_position)/(max_st_position-min_st_position)
    velocity = (m.state[2] - min_st_velocity)/(max_st_velocity-min_st_velocity)
    return [position,velocity]
end

#Minimal interfaces implemented
RLBase.action_space(m::ShippingEnv) = m.action_space
RLBase.state_space(m::ShippingEnv) = m.observation_space
RLBase.reward(m::ShippingEnv) = m.done ? 0.0 : m.reward
RLBase.is_terminated(m::ShippingEnv) = m.done 
RLBase.state(m::ShippingEnv) = state_normalization(m::ShippingEnv)
#Random.seed!(m::ShippingEnv,seed) = Random.seed!(m.rng,seed)

function RLBase.reset!(m::ShippingEnv)
    m.position = m.params.StartingPoint
    m.velocity = m.params.velocities[1]
    m.done = false
    m.time = 0
    m.distance = 0
    #nothing
end

#Action Space Map Parameters: Object Contruction
struct as_map_params
    nheading::Int64
    nacceleration::Int64
    nvelocities::Int64
end

function as_map_params(;
    shipping_env_params = ShippingEnvParams(),
    nheading = length(shipping_env_params.heading),
    nacceleration = length(shipping_env_params.acceleration),
    nvelocities = length(shipping_env_params.velocities)
    )
    as_map_params(
        nheading,
        nacceleration,
        nvelocities
    )
end

function as_map(;map_params = as_map_params())
    arr_heading = collect(1:map_params.nheading)
    arr_acceleration = collect(1:map_params.nacceleration)
    arr_velocities = collect(1:map_params.nvelocities)
    arr = [arr_heading, arr_acceleration, arr_velocities]

    temp_arr = collect(Base.product(arr[1],arr[2]))
    i = 3
    while i <= length(arr)
        temp_arr = collect(Base.product(temp_arr,arr[i]))
        i += 1
    end

        final_arr = vec(temp_arr)

    function remove_internal_tuples(vect)
        d = []
        d_internal = []
        while first(vect)[1] isa Tuple
            d = []
            for i in 1:length(vect)
                empty!(d_internal)
                for ii in 1:length(first(vect[i]))
                    push!(d_internal,first(vect[i])[ii])
                end
                for iii in 2:length(vect[i])
                    push!(d_internal, vect[i][iii])
                end
                append!(d,d_internal)
            end
            vect = []
            push!(vect,d)
        end
    
        function number_to_tuples(vect)
            t = []
            for i in 1:3:length(vect)
                push!(t,(vect[i],vect[i+1],vect[i+2]))
            end
            return t
        end
    
        return number_to_tuples(vect[1])
    end

        all_actions = remove_internal_tuples(final_arr)

    return all_actions
end


global all_actions = as_map();
#Function defining what happens every time an action is made
function (m::ShippingEnv)(a::Int64)
    nextstep(m,all_actions[a][1],all_actions[a][2])
end

function nextstep(m::ShippingEnv, head_action, acceleration)
    heading = m.params.heading[head_action]
    r = m.params.punishment #initialized punishment if everything's okay
    m.position += heading
    dist_covered = sqrt(heading[1]^2 + heading[2]^2)
    m.distance += dist_covered
    next_state_norm = (m.position[1]/m.params.gridworld_dims[1],m.position[2]/m.params.gridworld_dims[2]) #normalized for going inanypolygon
    #Check if next state is out of bounds and assign appropriate punishment
    if m.position[1]<1 || m.position[1]>m.params.gridworld_dims[1] || m.position[2]<1 || m.position[2]>m.params.gridworld_dims[2] || inanypolygon(next_state_norm, m.params.all_polygons)
        r = m.params.out_of_grid_punishment #replace punishment
        m.position -= heading
        m.distance -= dist_covered
    end

    #Checking if velocity+acceleration is out of velocities' bounds
    current_acceleration = m.params.acceleration[acceleration] #actual accelaration
    if (m.velocity + current_acceleration) > minimum(m.params.velocities) && (m.velocity + current_acceleration < maximum(m.params.velocities))
        m.velocity += current_acceleration #-2 is used because accelaration input is 1-3 and we want to either go to lower acceleration or greater
    end
    
    m.time = dist_covered/m.velocity
    #m.reward = r -m.time
    m.reward = r
    m.done = m.position == m.params.GoalPoint
    m.state = [LinearIndices((m.params.gridworld_dims[1],m.params.gridworld_dims[2]))[m.position],first(findall(x->x==m.velocity,m.params.velocities))]
end

#DQN agent
function agent_construction(;
    hidden1=40, 
    hidden2=50, 
    updhor = 2)

    agent = Agent(
        policy=QBasedPolicy(
            #DQNLearner will be used because of the option for double dqn.
            learner=DQNLearner( 
                approximator=NeuralNetworkApproximator(
                    model = Chain(
                        Dense(length(state(env)),hidden1,sigmoid),
                        Dense(hidden1,hidden2,sigmoid),
                        Dense(hidden2,length(action_space(env)),sigmoid)
                    ),
                    optimizer = ADAM(0.001), 
                ),
                target_approximator = NeuralNetworkApproximator(
                    model = Chain(
                        Dense(length(state(env)),hidden1,sigmoid),
                        Dense(hidden1,hidden2,sigmoid),
                        Dense(hidden2,length(action_space(env)),sigmoid)
                    ),
                    optimizer = ADAM(0.001), 
                ),
                loss_func = Flux.huber_loss,
                γ = 0.99f0, #discount rate
                batch_size = 50, #mini batch_size
                update_horizon = updhor, #G = r .+ γ^n .* (1 .- t) .* q′
                #---min_replay_history
                #number of transitions that should be made before updating the approximator
                #it is the replay_start_size = the count of experiences (frames) to add to replay buffer before starting training
                min_replay_history = 25, 
                update_freq = 4, #the frequency of updating the approximator
                #---target_update_freq 
                #how frequently we sync model weights from the main DQN network to the target DQN network
                #(how many frames in between syncing) 
                target_update_freq = 100, 
                stack_size = nothing, #use the recent stack_size frames to form a stacked state
                traces = SARTS, #current state, action, reward, terminal, next state
                rng = Random.GLOBAL_RNG,
                is_enable_double_DQN = true #enable double dqn, enabled by default
            ),
            explorer = EpsilonGreedyExplorer(;
                kind = :linear, 
                step = 1, #record the current step
                ϵ_init = 0.99, #initial epsilon
                warmup_steps = 0, #the number of steps to use ϵ_init
                decay_steps = 0, #the number of steps for epsilon to decay from ϵ_init to ϵ_stable
                ϵ_stable = 0.1, #the epislon after warmup_steps + decay_steps
                is_break_tie = true, #randomly select an action of the same maximum values if set to true.
                rng = Random.GLOBAL_RNG, #set the internal rng
                is_training = true #in training mode, step will not be updated and the epsilon will be set to 0. 
            )
        ),
        #A trajectory is the sequence of what has happened over a set of continuous timestamps
        trajectory=CircularArraySARTTrajectory(;
            capacity = 200,
            state = Vector{Float64} => (length(state(env)),),
            # action = Int => (),
            # reward = Float32 => (),
            # terminal = Bool => (),
        ) #when using NN you have to change from VectorSARTTrajectory to CircularArraysTraject
    )

    return agent
end

#Customized hook
Base.@kwdef mutable struct customized_hook <: AbstractHook
    velocity::Vector{Int64} = []
    velocity_total::Vector{Vector{Int64}} = []
    position:: Vector{CartesianIndex{2}} = []
    position_total:: Vector{Vector{CartesianIndex{2}}} = []
    reward::Vector{Float64} = []
    reward_total::Vector{Vector{Float64}} =[]
end

(h::customized_hook)(::PostActStage,agent,env) = 
begin 
    push!(h.velocity,env.velocity)
    push!(h.position,env.position)
    push!(h.reward,env.reward)
end

(h::customized_hook)(::PostEpisodeStage,agent,env) = 
begin 
    h.velocity_total = vcat(h.velocity_total,[h.velocity])
    h.position_total = vcat(h.position_total,[h.position])
    h.reward_total = vcat(h.reward_total,[h.reward])
end

(h::customized_hook)(::PreEpisodeStage,agent,env) = 
begin 
    h.velocity = []
    h.position = []
    h.reward = []
end

#Environment defining
env = ShippingEnv()
agent = agent_construction(;hidden1=40,hidden2=50,updhor=1)


#Run Experiment
hook = customized_hook();
stop_condition =  StopAfterEpisode(50, is_show_progress = true)
ex = Experiment(agent, env, stop_condition, hook, "#Test")
RLBase.test_runnable!(env)
@time run(ex)```

I can try to have a look tomorrow, is GeoBoundariesManipulation.jl still the same as you linked in the slack channel a while ago?

Yes, the GeoBoundariesManilulation.jl is the same. I tried today to run it with TD learner, just to check out if there’s anything wrong with the environment setup. TD agent managed to find the desired output. So, I suppose what goes wrong has to do with the DQN agent. Thank you Albin!

Okay, that is good, then it is likely only something with DQN that needs tuning.

I tried running your environment, though I increased the episodes to 10 times as many, and got something that looks like it is learning a little at least, though I’m not sure what your expected reward is?
plot_2

Have you done any hyperparameter tuning, I think DQN can be quite sensitive to especially some of the parameters.

1 Like

Did you change anything else except for the number of episodes? Yesterday, I ran it with 1000 episodes and the agent did not seem to learn.

Well, yes, I tried the following changes but still saw no result.

  • ADAM(0.01), ADAM(0.001)
  • batch_size: Minimum value I tried was 20 and maximum 200.
  • update_horizon: Minimum = 1, Maximum = 3
  • min_replay_history: It was always either equal to the batch size or something like 25% or 50% of it.
  • target_update_frequency: Minimum = 4, Maximum = 100
  • trajectory capacity: Minimum = 4, Maximum = 1000

The agent is getting -5 reward for every step it takes, so it should eventually learn to follow the optimal path.

No other changes.

Things I can think of that it could be

  • just lucky rng from my side since you use global rng we won’t have the same seed.
  • that you have made changes in GeoBoundariesManipulation.jl since the version you posted in slack?
  • that we have different versions of julia or packages where something might have changed.

My versions

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

(testrl) pkg> st
      Status `/tmp/testrl/Project.toml`
  [a93c6f00] DataFrames v1.3.3
  [587475ba] Flux v0.12.10
  [d8418881] Intervals v1.5.0
  [647866c9] PolygonOps v0.1.2
  [158674fc] ReinforcementLearning v0.10.0
  [2913bbd2] StatsBase v0.33.16

I have made no changes. Also, GeoBoundariesManipulations.jl is mainly used just to load the boundary data, so it doesn’t really affect anything.

Also, it seems that we have the same version. These are the rewards for 1000 episodes I ran yesterday.

plot_6

Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, haswell)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

(konmeso_thesis) pkg> st
      Status `C:\Users\flaskounas\Desktop\konmeso_thesis\Project.toml`
  [336ed68f] CSV v0.10.2
  [a93c6f00] DataFrames v1.3.2
  [587475ba] Flux v0.12.9
  [186bb1d3] Fontconfig v0.4.0
  [28b8d3ca] GR v0.62.1
  [d8418881] Intervals v1.5.0
  [91a5bcdd] Plots v1.24.3
  [647866c9] PolygonOps v0.1.2
  [158674fc] ReinforcementLearning v0.10.0
  [e575027e] ReinforcementLearningBase v0.9.7
  [2913bbd2] StatsBase v0.33.16
  [f3b207a7] StatsPlots v0.14.33
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [10745b16] Statistics

I set up an environment with the exact versions and also set the seed in the start, so if you extract this in a folder and start julia with that environment, either julia --project=. if you use a terminal or change the env in vscode. First instantiate the environment (make sure all same version are installed) by running ]instantiate and then running the file using e.g. include("RL-New-Env.jl"). It should run exactly the same as for me (I haven’t run a full run again with this seed, it is running right now and will be done in an hour or so, but I assume it will work again based on a quick shorter run I did). Give it a try and see if it looks any different for you.

https://www.dropbox.com/s/lq80c8rip5eyjsd/testrl.zip?dl=0

I realise that I seem to have included a file without the changes I mentioned, and also forgot one of the data directories.

I updated the files in the old link, but here it is again
https://www.dropbox.com/s/lq80c8rip5eyjsd/testrl.zip?dl=0
For me this created the plot below, still not a lot of learning but it seems like a small improvement in average episode reward over time at least.
plot_3

So you did not change anything else than adding the line Random.seed!(1).

I think that should be the only substantial line, I did some small changes in the GeoBoundariesManipulation.jl file to allow a non-windows machine to get the correct file paths but that shouldn’t change anything.

Maybe it could be the plotting, I plot the mean reward over each episode, if you plot the sum the episode reward would vary a lot depending on the length of the episode.

We had slightly different versions of some packages, but not something that should make a difference. So I don’t really know what it could be.

Does it work for you now? If it does, can you try a few different seeds and see if it seems to just be a bit unstable in the learning and that some seeds give more the result you saw before?

I ran it for 1000 episodes. It doesn’t seem like getting any better. It seems like being steady and not learning any further.
1000episodes

Okay, but now you have something that looks slightly more reasonable at least. Do you know what average reward you expect from a good agent? Is it very far away?

It could be worth trying some hyperparameter optimization, either just manually or you could use something like

Yes, I know that a good agent should complete an episode in under 50 steps. I have tried tuning hyperparameters manually but without any result. I’ll try using the package but I’m really worried that it won’t make any difference.

If you have tried some manual tuning it might not make any difference, that package just helps automating the task.

Something I often find helpful when debugging RL it to implement some reasonable agent that I can run against my environment just to verify that the environment is not incorrectly implemented by checking that I can reach some expected reward with it. Do you have some simple heuristic which could allow you to implement something like that?

You would just need to create your policy struct and implement one method for it, and then you should be able to use it as a drop in for the Agent in your code.
Something like

struct MyAgent <: AbstractPolicy
    # Can keep some internal state here if needed
end

function (agent::MyAgent)(env)
    s = state(env)
    a = ... # Probably some function of s
    return a
end

You also mentioned that you had tried other methods then DQN, did they reach any better scores?

Yes, I have tried TD learner. The result is the expected one. The path is finished in approximately 45 steps and the agent does that using the maximum available speed.

That’s a really nice idea for checking the environment. I’ll keep it in mind. I’ll try it out just for educational purposes. Thank you for sharing with me. However, I suppose that the TD learner would not get to the desired result if there was anything wrong with the environment.

I will try on a more detailed hyperparemeter tuning tomorrow in case it’s more sensitive than I initially thought. I will write down the results and I will share them with you tomorrow.

Yeah, if the TD learner manages it indeed seems like something amiss with the DQN agent. I have run the DQN agent on some classical environment and it seems to learn (though a bit more unstable than some other algorithms). So I don’t think there is necessarily something wrong with it, just that it is a bit unstable. And depending on the dynamics of your environment this might make it prone to get stuck with suboptimal policies.

My best bet would be to to a little more hyperparameter search then if you want to get better results out of specifically DQN.