I closed a similar topic I opened about one hour ago by mistake, here I try again with clearer example, the issue is that the same LayerNorm layer in pytorch and Flux has large difference in performance and I don’t know if it is expected or not.
If anybody could clarify this or tell me where to look for an answer I would be grateful here are two minimal working examples and the resulting benchmarks (both done in Pluto in case it affects the results):
In pytorch:
using PythonCall
torch = pyimport("torch")
np = pyimport("numpy")
l = torch.nn.LayerNorm(768)
v = torch.tensor(
np.array(
rand(Float32, 2, 128, 768)
)
)
@benchmark l(v)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 97.483 μs … 3.871 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 135.234 μs ┊ GC (median): 0.00%
Time (mean ± σ): 138.385 μs ± 42.160 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
Ok thanks! I didn’t go throught the code because I thought in no way I could understand it but I gues I could have. Then I gues that’s just how things are!!
Thanks so much for the explanation, I would have never understood the point. I will look into the code more closely since it is so straightforward to read, though without fully understanding all that is going on I guess.
If I can ask you what would you say is a good starting point in the code to understand the flux design?