Lux.jl Chain of Dense layers don't see speed up going from 32 bits -> 16 bits

jling · August 6, 2022, 10:06pm

using Lux, BenchmarkTools, Random
const rng  = Random.default_rng()
julia> let model = Chain(
                  Dense(23, 30, Lux.relu),
                  Dense(30, 25, Lux.relu),
                  Dense(25, 20, Lux.relu)
              )
              ps, st = Lux.setup(rng, model)
              @btime $model(inputs, $ps, $st)[1] setup=(inputs=rand(Float32, 23))
           end;
  364.164 ns (12 allocations: 1.17 KiB)

julia> let model = Chain(
                  Dense(23, 30, Lux.relu),
                  Dense(30, 25, Lux.relu),
                  Dense(25, 20, Lux.relu)
              )
              ps, st = Lux.setup(rng, model)
              @btime $model(inputs, $ps, $st)[1] setup=(inputs=rand(Float16, 23))
           end;
  400.045 ns (13 allocations: 1.33 KiB)

for reference, 64 bits is about 2x slower at 772.561 ns (12 allocations: 1.77 KiB)

(1.8-rc3)

gbaraldi · August 6, 2022, 10:14pm

Julia currently doesn’t use native float16 operations even if available and it adds some overhead due to the conversion overhead. If you were memory bound you might see some improvement but I doubt it. In addition I believe very few x86 chips have float16 operations so your mileage may vary. Hardware Float16 on A64fx · Issue #40216 · JuliaLang/julia · GitHub

jling · August 6, 2022, 10:15pm

that’s a bit unfortunate, if we can lift another ~2x from here we can compete with FPGA for some real time ML application…

it’s still possible we just need to find a CPU with better single core performance

gbaraldi · August 6, 2022, 10:17pm

If you have float16 operations you could try a build from source and comment out PM->add(createDemoteFloat16Pass()); in aotcompile.cpp to check if there is a possible improvement there. If you do there might be a 2x improvement

DrChainsaw · August 7, 2022, 8:05am

If those network sizes are representative of what you want to do then GitHub - PumasAI/SimpleChains.jl: Simple chains should give you a significant speedup.

jling · August 7, 2022, 8:49am

jesus that’s fast

julia> let model = SimpleChain(
                static(23), # input dimension (optional)
                TurboDense{true}(relu, 30), # dense layer with bias that maps to 8 outputs and applies `tanh` activation
                TurboDense{true}(relu, 25),
                TurboDense{true}(relu, 20)
              )
              p = SimpleChains.init_params(a);
              @btime $model(inputs, $p) setup=(inputs=rand(Float32, 23))
           end;
  77.845 ns (0 allocations: 0 bytes)

Topic		Replies	Views
Apples to apples comparison of A\b with Float64 and Float16 on A64FX Performance question , linearalgebra	12	788	May 2, 2022
Massive performance penalty for Float16 compared to Float32 Performance performance	17	8065	June 20, 2022
Does float16 run natively on a compatible CPU? General Usage	14	580	July 11, 2024
Speed Comparison Python v Julia for custom layers Machine Learning question , flux , python , machine-learning , speed-optimization	25	1448	July 28, 2023
Mix-mode training of large languages models in Julia Machine Learning	7	609	July 26, 2023

Lux.jl Chain of Dense layers don't see speed up going from 32 bits -> 16 bits

Related topics