I’m trying to use the above-named ReinforcementLearning.jl experiment to model training a DQN that calls out to another program and introduces some lag time. I can, of course, run the experiment but I can’t quite figure out where to insert a small sleep() call to simulate the external call. My first instinct is to override the _run() function, but I’m a julia noob and can’t quite figure it out.
My next thought is to wrap everything in an Agent() and put the sleep() in there but attempts have so far been fruitless…
I would want to put the lag time in each step. In the real project environment, the call out to the external program is the step… So, add the hook as a parameter to the learner? I’m still really hazy on how all these parts work.
So adding a sleep will really just make the problem run slower since the environment and the agent are both executed at discrete steps. So I think what you are asking for is that there should be an emulated delay somewhere between the environment (observing the state), the agent (selecting an action from the state) and back to the environment (enacting that action) to make control more difficult.
This could probably be done with some wrapper to either the state or the action where you implement a buffer holding values for some time-steps before they are used, thus introducing a delay.
See here for examples, e.g. how the StateTransformedEnv is implemented, and create a StateDelayedEnv something like this
struct StateDelayedEnv{P,E<:AbstractEnv} <: AbstractEnvWrapper
env::E
buffer::P
end
StateTransformedEnv(env) =
StateTransformedEnv(env, typeof(state(env))[]) # Maybe should use some circular buffer with desired length?
function (env::StateCachedEnv)(args...; kwargs...)
env.env(args...; kwargs...)
# push state(env.env) to newest spot in env.buffer
end
RLBase.state(env::StateTransformedEnv, args...; kwargs...) = # return oldest one from env.buffer
TYVM for your help! I’ll have to think this through a little more, maybe. The “harder to control” aspect is more a side-effect than a desired outcome. The more complete story is, we are investigating using a DQN in our RL problem but a very rough back-of-an-envelope calculation shows that training will take ~4 months (in Python). I had the idea to work in Julia to speed that up and the lead researcher has asked me to create a simulation in Julia using CartPole and DQN that has a 1 second lag representing a rough approximation of the request-response time so we can judge what training time will be like.
I like the idea of just adding a hook like DoEveryNEpisode using sleep(1) as the function and n=1. This seems fairly simple and straightforward to insert, and being a Julia newbie, simple and straightforward is good. I’m assuming I can insert the DoEveryNEpisode hook at the bottom of the function where the TotalRewardPerEpisode hook is in the JuliaRL_BasicDQN_CartPole experiment. I’m probably being way too naive here, but that’s how we learn…
though with that said, I don’t really understand what the benefit of this would be.
If you want to simulate an environment that has a delay in it, thus making control harder because of the state information being old when the action is selected or similar, this will not achieve that, that would be more the env wrapper approach I mentioned in the previous post.
If on the other hand you just want to simulate that training is slower for your real problem because communicating the data and updated weights take a little time or something similar, this could work, but far easier seems to be to just measure the time without the sleep as well as the number of episodes and just add them together? Or maybe I’m missing something here, but if you add a 1 second delay each episode, which does not affect training but just slow down code execution, this wouldn’t really achieve anything more than running at full speed an adding the number of episodes after right?
Thanks for the reply. I found ComposedHook() today and I think it does what I need.
The last part of the story is my boss has been running a ton of Python (PyTorch) RL experiments and he’s apparently found some odd bottlenecks that is causing our current panic. I thought maybe Julia could be an answer and when I suggested it, my boss suggested putting in the 1 second lag to simulate the return from the environment. (Also, don’t tell my boss, but, I’ve been a bit bored and looking for a reason to learn Julia…).
Anyway, thanks for the help. It’s great to see a helpful community, unlike some languages I’ve used in the past!