# DQN Reinforcement Learning Agent is Not Learning

Hello. I’m relatively new to Julia, I have only been coding in it the last 3 months. I’m trying to create a DQN RL agent that finds the optimal path while also choosing the maximum speed for doing so. However, it just won’t learn and I really can’t see why. Can someone please help me? I will add the script below.

``````using ReinforcementLearning
using Flux #Needed for all the Neural Networks functionalities
using Plots
using DelimitedFiles #Needed to read all the txt files
using PolygonOps
using Random
using Intervals

#GeoBoundariesManipulation
include(joinpath(pwd(),"GeoBoundariesManipulation.jl"));
using .GeoBoundariesManipulation

#My problem's parameters
struct ShippingEnvParams
gridworld_dims::Tuple{Int64,Int64} #Gridworld dimensions
velocities::Vector{Int64} #available velocities from 6 knots to 20 knots
acceleration::Vector{Int64} #available acceleration per step: -2, 0, 2
punishment::Int64 #punishment per ordinary step
out_of_grid_punishment::Int64 #punishment for going towards an island or out of grid bounds
StartingPoint::CartesianIndex{2}
GoalPoint::CartesianIndex{2}
all_polygons::Vector{Vector{Tuple{Float64,Float64}}} #all the boundaries
end

function ShippingEnvParams(;
gridworld_dims = (50,50),
velocities = Vector((6:2:20)),
acceleration = Vector((-2:2:2)),
punishment = -5,
out_of_grid_punishment = -10,
StartingPoint = GeoBoundariesManipulation.GoalPointToCartesianIndex((-6.733535,61.997345),gridworld_dims[1],gridworld_dims[2]),
EndingPoint = GeoBoundariesManipulation.GoalPointToCartesianIndex((-6.691500,61.535580),gridworld_dims[1],gridworld_dims[2]),
)
ShippingEnvParams(
gridworld_dims,
velocities,
acceleration,
punishment,
out_of_grid_punishment,
StartingPoint,
EndingPoint,
AllPolygons
)
end

###ENVIRONMENT CONSTRUCTION
#Instance
mutable struct ShippingEnv <: AbstractEnv
params::ShippingEnvParams
action_space::Base.OneTo{Int64}
#observation_space::Space{Vector{UnitRange{Int64}}} #state_space
observation_space::Space{Vector{Interval{Int64,Closed,Closed}}}
state::Vector{Int64} #state: (position,velocity)
done::Bool #checks if agent has reached its goal
position::CartesianIndex{2}
time::Float64
velocity::Int64
distance::Float64
reward::Union{Nothing,Float64}
end

function ShippingEnv()
params1 = ShippingEnvParams()
env = ShippingEnv(
params1,
#Space([1..params.gridworld_dims[1]*params.gridworld_dims[2],minimum(params.velocities)..maximum(params.velocities)]),
Space([0..1,0..1]),
#Space([1:(params1.gridworld_dims[1]*params1.gridworld_dims[2]),(1:length(params1.velocities))]), #(1-number of grid tiles, 1-number of velocity options)
[LinearIndices((params1.gridworld_dims[1],params1.gridworld_dims[2]))[params1.StartingPoint],1],
false,
params1.StartingPoint,
0.0,
params1.velocities[1],
0.0,
0.0
)
reset!(env)
env
end

function state_normalization(m::ShippingEnv)
max_st_position = m.params.gridworld_dims[1]*m.params.gridworld_dims[2]
min_st_position = 1
max_st_velocity = length(m.params.velocities)
min_st_velocity = 1

position = (m.state[1] - min_st_position)/(max_st_position-min_st_position)
velocity = (m.state[2] - min_st_velocity)/(max_st_velocity-min_st_velocity)
return [position,velocity]
end

#Minimal interfaces implemented
RLBase.action_space(m::ShippingEnv) = m.action_space
RLBase.state_space(m::ShippingEnv) = m.observation_space
RLBase.reward(m::ShippingEnv) = m.done ? 0.0 : m.reward
RLBase.is_terminated(m::ShippingEnv) = m.done
RLBase.state(m::ShippingEnv) = state_normalization(m::ShippingEnv)
#Random.seed!(m::ShippingEnv,seed) = Random.seed!(m.rng,seed)

function RLBase.reset!(m::ShippingEnv)
m.position = m.params.StartingPoint
m.velocity = m.params.velocities[1]
m.done = false
m.time = 0
m.distance = 0
#nothing
end

#Action Space Map Parameters: Object Contruction
struct as_map_params
nacceleration::Int64
nvelocities::Int64
end

function as_map_params(;
shipping_env_params = ShippingEnvParams(),
nacceleration = length(shipping_env_params.acceleration),
nvelocities = length(shipping_env_params.velocities)
)
as_map_params(
nacceleration,
nvelocities
)
end

function as_map(;map_params = as_map_params())
arr_acceleration = collect(1:map_params.nacceleration)
arr_velocities = collect(1:map_params.nvelocities)

temp_arr = collect(Base.product(arr[1],arr[2]))
i = 3
while i <= length(arr)
temp_arr = collect(Base.product(temp_arr,arr[i]))
i += 1
end

final_arr = vec(temp_arr)

function remove_internal_tuples(vect)
d = []
d_internal = []
while first(vect)[1] isa Tuple
d = []
for i in 1:length(vect)
empty!(d_internal)
for ii in 1:length(first(vect[i]))
push!(d_internal,first(vect[i])[ii])
end
for iii in 2:length(vect[i])
push!(d_internal, vect[i][iii])
end
append!(d,d_internal)
end
vect = []
push!(vect,d)
end

function number_to_tuples(vect)
t = []
for i in 1:3:length(vect)
push!(t,(vect[i],vect[i+1],vect[i+2]))
end
return t
end

return number_to_tuples(vect[1])
end

all_actions = remove_internal_tuples(final_arr)

return all_actions
end

global all_actions = as_map();
#Function defining what happens every time an action is made
function (m::ShippingEnv)(a::Int64)
nextstep(m,all_actions[a][1],all_actions[a][2])
end

r = m.params.punishment #initialized punishment if everything's okay
m.distance += dist_covered
next_state_norm = (m.position[1]/m.params.gridworld_dims[1],m.position[2]/m.params.gridworld_dims[2]) #normalized for going inanypolygon
#Check if next state is out of bounds and assign appropriate punishment
if m.position[1]<1 || m.position[1]>m.params.gridworld_dims[1] || m.position[2]<1 || m.position[2]>m.params.gridworld_dims[2] || inanypolygon(next_state_norm, m.params.all_polygons)
r = m.params.out_of_grid_punishment #replace punishment
m.distance -= dist_covered
end

#Checking if velocity+acceleration is out of velocities' bounds
current_acceleration = m.params.acceleration[acceleration] #actual accelaration
if (m.velocity + current_acceleration) > minimum(m.params.velocities) && (m.velocity + current_acceleration < maximum(m.params.velocities))
m.velocity += current_acceleration #-2 is used because accelaration input is 1-3 and we want to either go to lower acceleration or greater
end

m.time = dist_covered/m.velocity
#m.reward = r -m.time
m.reward = r
m.done = m.position == m.params.GoalPoint
m.state = [LinearIndices((m.params.gridworld_dims[1],m.params.gridworld_dims[2]))[m.position],first(findall(x->x==m.velocity,m.params.velocities))]
end

#DQN agent
function agent_construction(;
hidden1=40,
hidden2=50,
updhor = 2)

agent = Agent(
policy=QBasedPolicy(
#DQNLearner will be used because of the option for double dqn.
learner=DQNLearner(
approximator=NeuralNetworkApproximator(
model = Chain(
Dense(length(state(env)),hidden1,sigmoid),
Dense(hidden1,hidden2,sigmoid),
Dense(hidden2,length(action_space(env)),sigmoid)
),
),
target_approximator = NeuralNetworkApproximator(
model = Chain(
Dense(length(state(env)),hidden1,sigmoid),
Dense(hidden1,hidden2,sigmoid),
Dense(hidden2,length(action_space(env)),sigmoid)
),
),
loss_func = Flux.huber_loss,
γ = 0.99f0, #discount rate
batch_size = 50, #mini batch_size
update_horizon = updhor, #G = r .+ γ^n .* (1 .- t) .* q′
#---min_replay_history
#number of transitions that should be made before updating the approximator
#it is the replay_start_size = the count of experiences (frames) to add to replay buffer before starting training
min_replay_history = 25,
update_freq = 4, #the frequency of updating the approximator
#---target_update_freq
#how frequently we sync model weights from the main DQN network to the target DQN network
#(how many frames in between syncing)
target_update_freq = 100,
stack_size = nothing, #use the recent stack_size frames to form a stacked state
traces = SARTS, #current state, action, reward, terminal, next state
rng = Random.GLOBAL_RNG,
is_enable_double_DQN = true #enable double dqn, enabled by default
),
explorer = EpsilonGreedyExplorer(;
kind = :linear,
step = 1, #record the current step
ϵ_init = 0.99, #initial epsilon
warmup_steps = 0, #the number of steps to use ϵ_init
decay_steps = 0, #the number of steps for epsilon to decay from ϵ_init to ϵ_stable
ϵ_stable = 0.1, #the epislon after warmup_steps + decay_steps
is_break_tie = true, #randomly select an action of the same maximum values if set to true.
rng = Random.GLOBAL_RNG, #set the internal rng
is_training = true #in training mode, step will not be updated and the epsilon will be set to 0.
)
),
#A trajectory is the sequence of what has happened over a set of continuous timestamps
trajectory=CircularArraySARTTrajectory(;
capacity = 200,
state = Vector{Float64} => (length(state(env)),),
# action = Int => (),
# reward = Float32 => (),
# terminal = Bool => (),
) #when using NN you have to change from VectorSARTTrajectory to CircularArraysTraject
)

return agent
end

#Customized hook
Base.@kwdef mutable struct customized_hook <: AbstractHook
velocity::Vector{Int64} = []
velocity_total::Vector{Vector{Int64}} = []
position:: Vector{CartesianIndex{2}} = []
position_total:: Vector{Vector{CartesianIndex{2}}} = []
reward::Vector{Float64} = []
reward_total::Vector{Vector{Float64}} =[]
end

(h::customized_hook)(::PostActStage,agent,env) =
begin
push!(h.velocity,env.velocity)
push!(h.position,env.position)
push!(h.reward,env.reward)
end

(h::customized_hook)(::PostEpisodeStage,agent,env) =
begin
h.velocity_total = vcat(h.velocity_total,[h.velocity])
h.position_total = vcat(h.position_total,[h.position])
h.reward_total = vcat(h.reward_total,[h.reward])
end

(h::customized_hook)(::PreEpisodeStage,agent,env) =
begin
h.velocity = []
h.position = []
h.reward = []
end

#Environment defining
env = ShippingEnv()
agent = agent_construction(;hidden1=40,hidden2=50,updhor=1)

#Run Experiment
hook = customized_hook();
stop_condition =  StopAfterEpisode(50, is_show_progress = true)
ex = Experiment(agent, env, stop_condition, hook, "#Test")
RLBase.test_runnable!(env)
@time run(ex)`````````

I can try to have a look tomorrow, is `GeoBoundariesManipulation.jl` still the same as you linked in the slack channel a while ago?

Yes, the GeoBoundariesManilulation.jl is the same. I tried today to run it with TD learner, just to check out if there’s anything wrong with the environment setup. TD agent managed to find the desired output. So, I suppose what goes wrong has to do with the DQN agent. Thank you Albin!

Okay, that is good, then it is likely only something with DQN that needs tuning.

I tried running your environment, though I increased the episodes to 10 times as many, and got something that looks like it is learning a little at least, though I’m not sure what your expected reward is?

Have you done any hyperparameter tuning, I think DQN can be quite sensitive to especially some of the parameters.

1 Like

Did you change anything else except for the number of episodes? Yesterday, I ran it with 1000 episodes and the agent did not seem to learn.

Well, yes, I tried the following changes but still saw no result.

• batch_size: Minimum value I tried was 20 and maximum 200.
• update_horizon: Minimum = 1, Maximum = 3
• min_replay_history: It was always either equal to the batch size or something like 25% or 50% of it.
• target_update_frequency: Minimum = 4, Maximum = 100
• trajectory capacity: Minimum = 4, Maximum = 1000

The agent is getting -5 reward for every step it takes, so it should eventually learn to follow the optimal path.

No other changes.

Things I can think of that it could be

• just lucky rng from my side since you use global rng we won’t have the same seed.
• that you have made changes in `GeoBoundariesManipulation.jl` since the version you posted in slack?
• that we have different versions of julia or packages where something might have changed.

My versions

``````julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = code

(testrl) pkg> st
Status `/tmp/testrl/Project.toml`
[a93c6f00] DataFrames v1.3.3
[587475ba] Flux v0.12.10
[d8418881] Intervals v1.5.0
[647866c9] PolygonOps v0.1.2
[158674fc] ReinforcementLearning v0.10.0
[2913bbd2] StatsBase v0.33.16
``````

I have made no changes. Also, GeoBoundariesManipulations.jl is mainly used just to load the boundary data, so it doesn’t really affect anything.

Also, it seems that we have the same version. These are the rewards for 1000 episodes I ran yesterday.

``````Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, haswell)
Environment:
JULIA_EDITOR = code

(konmeso_thesis) pkg> st
[336ed68f] CSV v0.10.2
[a93c6f00] DataFrames v1.3.2
[587475ba] Flux v0.12.9
[186bb1d3] Fontconfig v0.4.0
[28b8d3ca] GR v0.62.1
[d8418881] Intervals v1.5.0
[91a5bcdd] Plots v1.24.3
[647866c9] PolygonOps v0.1.2
[158674fc] ReinforcementLearning v0.10.0
[e575027e] ReinforcementLearningBase v0.9.7
[2913bbd2] StatsBase v0.33.16
[f3b207a7] StatsPlots v0.14.33
[8bb1440f] DelimitedFiles
[37e2e46d] LinearAlgebra
[10745b16] Statistics``````

I set up an environment with the exact versions and also set the seed in the start, so if you extract this in a folder and start julia with that environment, either `julia --project=.` if you use a terminal or change the env in vscode. First instantiate the environment (make sure all same version are installed) by running `]instantiate` and then running the file using e.g. `include("RL-New-Env.jl")`. It should run exactly the same as for me (I haven’t run a full run again with this seed, it is running right now and will be done in an hour or so, but I assume it will work again based on a quick shorter run I did). Give it a try and see if it looks any different for you.

I realise that I seem to have included a file without the changes I mentioned, and also forgot one of the data directories.

I updated the files in the old link, but here it is again

For me this created the plot below, still not a lot of learning but it seems like a small improvement in average episode reward over time at least.

So you did not change anything else than adding the line Random.seed!(1).

I think that should be the only substantial line, I did some small changes in the `GeoBoundariesManipulation.jl` file to allow a non-windows machine to get the correct file paths but that shouldn’t change anything.

Maybe it could be the plotting, I plot the mean reward over each episode, if you plot the sum the episode reward would vary a lot depending on the length of the episode.

We had slightly different versions of some packages, but not something that should make a difference. So I don’t really know what it could be.

Does it work for you now? If it does, can you try a few different seeds and see if it seems to just be a bit unstable in the learning and that some seeds give more the result you saw before?

I ran it for 1000 episodes. It doesn’t seem like getting any better. It seems like being steady and not learning any further.

Okay, but now you have something that looks slightly more reasonable at least. Do you know what average reward you expect from a good agent? Is it very far away?

It could be worth trying some hyperparameter optimization, either just manually or you could use something like

Yes, I know that a good agent should complete an episode in under 50 steps. I have tried tuning hyperparameters manually but without any result. I’ll try using the package but I’m really worried that it won’t make any difference.

If you have tried some manual tuning it might not make any difference, that package just helps automating the task.

Something I often find helpful when debugging RL it to implement some reasonable agent that I can run against my environment just to verify that the environment is not incorrectly implemented by checking that I can reach some expected reward with it. Do you have some simple heuristic which could allow you to implement something like that?

You would just need to create your policy struct and implement one method for it, and then you should be able to use it as a drop in for the `Agent` in your code.
Something like

``````struct MyAgent <: AbstractPolicy
# Can keep some internal state here if needed
end

function (agent::MyAgent)(env)
s = state(env)
a = ... # Probably some function of s
return a
end
``````

You also mentioned that you had tried other methods then DQN, did they reach any better scores?

Yes, I have tried TD learner. The result is the expected one. The path is finished in approximately 45 steps and the agent does that using the maximum available speed.

That’s a really nice idea for checking the environment. I’ll keep it in mind. I’ll try it out just for educational purposes. Thank you for sharing with me. However, I suppose that the TD learner would not get to the desired result if there was anything wrong with the environment.

I will try on a more detailed hyperparemeter tuning tomorrow in case it’s more sensitive than I initially thought. I will write down the results and I will share them with you tomorrow.

Yeah, if the TD learner manages it indeed seems like something amiss with the DQN agent. I have run the DQN agent on some classical environment and it seems to learn (though a bit more unstable than some other algorithms). So I don’t think there is necessarily something wrong with it, just that it is a bit unstable. And depending on the dynamics of your environment this might make it prone to get stuck with suboptimal policies.

My best bet would be to to a little more hyperparameter search then if you want to get better results out of specifically DQN.