Initializing Flux weights the same as PyTorch?

DevJac · February 4, 2021, 11:09pm

I am trying to replicate some results from a PyTorch model. I believe layers in Flux have their weights initialized quite differently than in PyTorch. For example, Flux layers have 0 bias to begin with, but PyTorch layers do have some bias by default.

Does anyone know how I can initialize Flux weights the same way PyTorch does?

I suspect as Julia and Flux grow in popularity, I wont be the only one wanting to do this.

ToucheSir · February 5, 2021, 1:26am

The PyTorch layer docs will generally specify how each parameter is initialized. It’s also a single click to see the source, which shows you exactly which functions are being called. I believe Utility Functions · Flux implements almost all of torch.nn.init — PyTorch 1.7.1 documentation already.

DevJac · February 5, 2021, 1:45am

Dense(512, 128, initW=(dims...) -> Flux.kaiming_uniform(dims...; gain=sqrt(1/3)), initb=initW=(dims...) -> Flux.kaiming_uniform(dims...; gain=sqrt(1/3)))

Seems to initialize the weights similar to PyTorch. I don’t know why PyTorch is using a gain of sqrt(1/3) but that’s what the source seems to show.

See: pytorch/linear.py at master · pytorch/pytorch · GitHub

In the future you can just do:

Dense(512, 128, initW=Flux.kaiming_uniform(gain=sqrt(1/3)), initb=Flux.kaiming_uniform(gain=sqrt(1/3)))

That will compile now, but wont work. You’ll have to wait for my Flux PR to be merged before this shorter line will work: Fix layer init functions kwargs getting overwritten by DevJac · Pull Request #1499 · FluxML/Flux.jl · GitHub

Edit: Actually, this post isn’t quite right. The initW is correct, but there is no way to use Flux.kaiming_uniform to initialize the biases the same way as PyTorch, as far as I can tell.

DevJac · February 5, 2021, 5:05am

I came up with this function to initialize the weights the same way PyTorch does:

function Linear(in, out, activation)
    Dense(in, out, activation,
          initW=(_dims...) -> Float32.((rand(out, in).-0.5).*(2/sqrt(in))),
          initb=(_dims...) -> Float32.((rand(out).-0.5).*(2/sqrt(in))))
end

At least, for PyTorch’s Linear layers that’s how it works. You can easily verify this by creating a PyTorch Linear layer and looking at the minimum and maximum weight and bias values.

DevJac · February 9, 2021, 10:04pm

I never did managed to replicate this particular Q-learning algorithm I was trying to. I eventually saw that the RMSProp implementations of PyTorch and Flux are different: PyTorch’s RMSProp has a smoothing parameter and Flux’s doesn’t. I don’t know if that alone was the cause of my problems.

Full story: I successfully replicated a simpler Q-learning algorithm, and was getting metrics very similar to the implementation I was trying to replicate. Then I tried this more complicated algorithm and wasn’t able to get similar results with similar hyper-parameters. I was able to get both algorithms to work in Flux, ultimately, but it sometimes required different hyper-parameters.

Topic		Replies	Views
How to implement custom weight initialization in Flux? General Usage	1	609	March 14, 2022
Impose initialization adn normalization on layers in Flux Machine Learning first-steps	2	730	September 11, 2020
Flux has no Lecun Normalization weight init function? New to Julia flux , machine-learning	0	83	October 9, 2024
Initialize weights for Flux.Dense New to Julia flux	1	1164	August 8, 2020
How to create Dense with initialed weights in vector General Usage flux	0	31	October 14, 2024

Initializing Flux weights the same as PyTorch?

Related topics