I am trying to replicate some results from a PyTorch model. I believe layers in Flux have their weights initialized quite differently than in PyTorch. For example, Flux layers have 0 bias to begin with, but PyTorch layers do have some bias by default.
Does anyone know how I can initialize Flux weights the same way PyTorch does?
I suspect as Julia and Flux grow in popularity, I wont be the only one wanting to do this.
The PyTorch layer docs will generally specify how each parameter is initialized. It’s also a single click to see the source, which shows you exactly which functions are being called. I believe Utility Functions · Flux implements almost all of torch.nn.init — PyTorch 1.7.1 documentation already.
Edit: Actually, this post isn’t quite right. The initW is correct, but there is no way to use Flux.kaiming_uniform to initialize the biases the same way as PyTorch, as far as I can tell.
I came up with this function to initialize the weights the same way PyTorch does:
function Linear(in, out, activation)
Dense(in, out, activation,
initW=(_dims...) -> Float32.((rand(out, in).-0.5).*(2/sqrt(in))),
initb=(_dims...) -> Float32.((rand(out).-0.5).*(2/sqrt(in))))
end
At least, for PyTorch’s Linear layers that’s how it works. You can easily verify this by creating a PyTorch Linear layer and looking at the minimum and maximum weight and bias values.
I never did managed to replicate this particular Q-learning algorithm I was trying to. I eventually saw that the RMSProp implementations of PyTorch and Flux are different: PyTorch’s RMSProp has a smoothing parameter and Flux’s doesn’t. I don’t know if that alone was the cause of my problems.
Full story: I successfully replicated a simpler Q-learning algorithm, and was getting metrics very similar to the implementation I was trying to replicate. Then I tried this more complicated algorithm and wasn’t able to get similar results with similar hyper-parameters. I was able to get both algorithms to work in Flux, ultimately, but it sometimes required different hyper-parameters.