Flux multi-cpu parallelism?

I’ve hit a wall with trying to optimize my code…

I’m trying to type-II maximum likelihood training for exact Gaussian process using Flux to get the hang of things. As much as I’ve tried to optimize my code, I’m struggling to get performance that is comparable to PyTorch on my laptop CPU. For example, training in PyTorch takes about ~75 seconds while training with Flux takes about ~400 seconds. After checking my code avoids obvious problems (I hope), I’ve tried the following:

  • Setting the JULIA_NUM_THREADS environment variable to the number of available cores.

  • Installing MKL

  • Setting the number of BLAS threads to 12

The most expensive operation in the “forward pass” of the GP is sped up by ~ a factor of 2 when increasing the number of BLAS threads but the training time is increased by ~ 20 seconds.

This makes me think that perhaps the problem is coming from the fact that Flux is not taking advantage of my multiple cpus. After reading posts like this one, I wonder if Flux isn’t intended for this use case? (But maybe this is coming from my own ignorance of Julia+Flux).

I should note that on a previous question it was pointed out that log and exp are slower than they should be but this will be fixed in 1.6. I don’t think that the problems with these functions should be causing the slow down I’m seeing.

Any help that someone could provide would be greatly appreciated here! I’d love to recommend Julia across the board to my lab but I want to make sure I can produce code with comparable performance to PyTorch on problems like the ones we work on first.

For anyone interested, I’ve attached my code below. Note run times were calculated by running

> main(400, 5000)
> @time main(400, 5000) 

from the REPL.

using LinearAlgebra
using Flux 
using Flux.Optimise: update!
using Plots

"""
Computes the distance between each coordinate for each point in x
Returns a matrix whose dimension is n+1 where n is the dimension of x
"""
function difference(x1::AbstractArray{T,2}, x2::AbstractArray{T,2}) where T
    s1, s2 = size(x1)
    t1, t2 = size(x2)
    return reshape(x1, (s1, s2, 1)) .- reshape(x2, (t1, 1, t2)) 
end

"""
Squared exponential kernel with length scale ℓ and σf = exp(logσf)
"""
struct sekernel{T <: AbstractArray}
    logℓ::T
    logσf::T
end
Flux.@functor sekernel
sekernel(ℓ::AbstractFloat, σf::AbstractFloat) = sekernel([ℓ], [σf])

function get_natural_params(k::sekernel)
    return exp(k.logℓ[1]), exp(k.logσf[1])
end

function (k::sekernel)(x1::AbstractArray{T,2}, x2::AbstractArray{T,2}) where T 
    ℓ, σf = get_natural_params(k)
    sqrd = dropdims(sum(difference(x1, x2).^2, dims=1), dims=1)
    return σf^2 .* exp.(-1 / (2 * ℓ^2) * sqrd)
end

"""
Gaussian process container 
x : training inputs, f : training targets, k : kernel struct, logσn : i.i.d noise 
"""
struct GP{U,S <: AbstractArray,T <: AbstractArray,I <: AbstractArray}
    x::S
    f::T
    k::U
    logσn::I
end
Flux.@functor GP
function GP(x::AbstractArray, f::AbstractArray, logσn::AbstractFloat, k)
    return GP(x, f, k, [logσn])
end
function get_natural_params(g::GP)
    return exp(g.logσn[1])
end
Flux.trainable(g::GP) = (g.logσn, g.k)
function (g::GP)(xs::AbstractArray)
    σn = get_natural_params(g)
    μ, Σ = infer(g.f, g.k(g.x, g.x) + I .* σn^2, g.k(g.x, xs), g.k(xs, xs))
    return (μ, Σ)
end

""" 
Function for performing inference with a GP 
"""
function infer(f::AbstractArray{T}, # check this (maybe just AbstractArray{T}?)
                  K11::AbstractArray{T,2}, 
                  K12::AbstractArray{T,2}, 
                  K22::AbstractArray{T,2}) where T
    C = cholesky(Symmetric(K11)) # adding 1e-12 for numerical stability
    μ = K12' * (C \ f)
    Σ = K22 - K12' * (C \ K12)
    return μ, Σ
end

""" 
Computes the log_mll for a Gaussian process 
"""
function log_mll(f::AbstractArray{T}, K11::AbstractArray{T,2}) where T
    C = cholesky(Symmetric(K11))
    return -0.5 * f' * (C \ f) - 0.5 * logdet(C) - length(f) / 2.0 * log(2.0 * π)
end
function log_mll(g::GP)
    σn= get_natural_params(g)
    return log_mll(g.f, g.k(g.x,g.x) + I .* σn^2)
end

"""
type-ii maximum likelihood training of GP 
"""
function train_gp!(model::GP, epochs::Integer, lr=1e-1)
    opt = Flux.Optimise.ADAM(lr)
    ps = Flux.params(model)
    for i in 1:epochs
        gs = gradient(ps) do
            return -1.0*log_mll(model)
        end
        update!(opt, ps, gs)
        if i % 1000 == 0
            @info "Epoch $i | len $(exp.(model.k.logℓ))  | σf $(exp.(model.k.logσf)) | σn $(exp.(model.logσn)) | lml $(log_mll(model))"
        end
    end
end


function main(num_data, epochs)
    x = collect(LinRange(0., 10., num_data))
    y_true = x + sin.(x) * 5 .- 10.0 
    y = y_true + randn(length(x)) * 2.0

    model = GP(reshape(x, (1, num_data)), y, -4.0, sekernel(0.1, -4.0))	

    train_gp!(model, epochs)

    xt = collect(LinRange(0., 10., 500))
    μ, Σ = model(reshape(xt,(1,500)) )
    σ = 2 * sqrt.(diag(Σ))

    p = scatter(x, y, label="Measurements", alpha=0.5, xlabel="x", ylabel="y")
    p = plot!(x, y_true, linestyle=:dash, label="Generating Func.", linewidth=2)
    p = plot!(xt, μ, ribbon=σ, label="Exact GP", linewidth=2)
end
1 Like

There are couple of things at play here:

  1. Flux and PyTorch are equivalent in that both frameworks execute all non-accelerated (e.g. not BLAS, NNPACK, CuDNN) code single-threaded and shell out to ops that are usually multi-threaded. There can be gains from running non-BLAS Flux code on multiple threads/processes, but that should (in theory) not be required to match a normal PyTorch training loop and isn’t even the same programming model (i.e. the PyTorch equivalent would require a GIL-less Python).
  2. Current Flux incurs a large time penalty for the TTFG (time to first gradient) on both the forward and backwards pass. Unless you’re only calling gradient once per REPL session/script, the recommended way to benchmark is to “pre-warming” the forward pass by inserting a model(<dummy data>) call before the training loop you want to benchmark. This won’t reduce absolute execution times, but it will help isolate the parts you can control.
  3. exp and derived functions like tanh are the biggest bottleneck for Flux on CPU right now. IIRC the performance delta is under 10% for a normal dense relu network. I’d have a look at some of the “fast exp” options mentioned in Flux vs pytorch cpu performance and see if those work for you. Also worth a run through a profiler to see what other bottlenecks may be present.
  4. Flux and PyTorch performance are far closer on GPU. If you intend to train and experiment on GPU anyhow, it’s worth testing there as well.
  5. Are both Flux and PyTorch using Float64? If the latter is on Float32 by default, that could also be a performance advantage.

I don’t want to sugar-coat things: Flux does not currently benchmark well on CPU for “relatively linear” (i.e. non-SciML) models, and the “Zygote era” has certainly not helped. On the bright side, here are a couple improvements you can look forward to in the not-so-distant future:

  1. A new-and-improved AD system that should dramatically improve upon Zygote’s performance (the biggest piece of TTFG).
  2. CPU-specific tuning in Flux and NNlib. For example, https://github.com/FluxML/NNlib.jl/pull/191 (mentioned in your previous thread) has since landed in an NNlib release.
5 Likes

Good news: Julia 1.6 should be about 5x faster for exp and 3x faster for tanh.

Also LoopVectorization should give another 2x speedup over that or so since it provides hand vectorized versions with slightly loser error tolerances (2-3 ULP instead of <1.5 ULP).

4 Likes

That relied on @inline and @simd ivdep. @simd ivdep had to be applied on the user side, and @inline on the function definition itself. However, the @inline was removed.

BetterExp.exp is still @inline, so we can see a large performance difference between it and Base.exp when using @simd ivdep:

julia> using BenchmarkTools, BetterExp

julia> function simdforeach!(f, y, x)
           @inbounds @simd ivdep for i ∈ eachindex(x,y)
               y[i] = f(x[i])
           end
       end
simdforeach! (generic function with 1 method)

julia> x256 = randn(256) .*= 10; y256 = similar(x256);

julia> x257 = randn(257) .*= 10; y257 = similar(x257);

julia> x255 = randn(255) .*= 10; y255 = similar(x255);

julia> @btime simdforeach!(exp, $y255, $x255)
  934.000 ns (0 allocations: 0 bytes)

julia> @btime simdforeach!(BetterExp.exp, $y255, $x255)
  170.070 ns (0 allocations: 0 bytes)

julia> @btime simdforeach!(exp, $y256, $x256)
  936.483 ns (0 allocations: 0 bytes)

julia> @btime simdforeach!(BetterExp.exp, $y256, $x256)
  154.595 ns (0 allocations: 0 bytes)

julia> @btime simdforeach!(exp, $y257, $x257)
  940.107 ns (0 allocations: 0 bytes)

julia> @btime simdforeach!(BetterExp.exp, $y257, $x257)
  158.750 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.6.0-DEV.1491
Commit ba06f439d1 (2020-11-14 02:48 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, tigerlake)

BetterExp.exp is around 6x faster than Base.exp on Julia 1.6 on my computer (when using @simd ivdep).

For comparison, using LoopVectorization on Julia 1.5:

julia> using LoopVectorization, BenchmarkTools

julia> x256 = randn(256) .*= 10; y256 = similar(x256);

julia> x257 = randn(257) .*= 10; y257 = similar(x257);

julia> x255 = randn(255) .*= 10; y255 = similar(x255);

julia> @btime @avx @. $y255 = exp($x255);
  99.572 ns (0 allocations: 0 bytes)

julia> @btime @avx @. $y256 = exp($x256);
  98.998 ns (0 allocations: 0 bytes)

julia> @btime @avx @. $y257 = exp($x257);
  104.570 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.5.4-pre.0
Commit 599ecd8210 (2020-11-10 10:50 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, tigerlake)
1 Like

I’d forgotten how much the @inline helped. Hopefully that can be added to the Base version at some point. There was a worry that it would lead to slower real-world code due to register pressure etc. Is there a good way to benchmark this across a wide range of user applications so I can make the 1 line PR to re-add @inline? It was removed mainly because the @inline was slightly controversial and I wanted to get the hard part merged, but now that it’s merged, I think that with a little data we might be able to re-add the @inline

For LLVM to be able to vectorize the function, it must be inlined. That is why it is crucial.
Because of the lookup table, we also need @simd ivdep for vectorization, otherwise LLVM fails to prove that the table doesn’t alias.

So the ideal thing would be to make it so that inlining is much more aggressive inside @simd ivdep for loops – aggressive enough to inline exp. Then the function can inline when there’s the chance for massive benefit (i.e., when inlining can enable vectorization), but won’t inline otherwise.

If we test random unrolled code, inlining doesn’t help nearly as much:

julia> f_inline(x::NTuple{N}) where {N} = ntuple(@inline(i -> BetterExp.exp(x[i])), Val(N))
f_inline (generic function with 1 method)

julia> f_noinline(x::NTuple{N}) where {N} = ntuple((i -> BetterExp.exp(x[i])), Val(N))
f_noinline (generic function with 1 method)

julia> x3 = ntuple(_ -> 10randn(), Val(3));

julia> x4 = ntuple(_ -> 10randn(), Val(4));

julia> x5 = ntuple(_ -> 10randn(), Val(5));

julia> x8 = ntuple(_ -> 10randn(), Val(8));

julia> @btime f_inline($x3)
  11.543 ns (0 allocations: 0 bytes)
(0.0001799829216836168, 438.0048703195676, 0.026253159432962376)

julia> @btime f_noinline($x3)
  12.367 ns (0 allocations: 0 bytes)
(0.0001799829216836168, 438.0048703195676, 0.026253159432962376)

julia> @btime f_inline($x4)
  13.022 ns (0 allocations: 0 bytes)
(2.895112921784915e-5, 3.3413971489423987e-9, 1.1891815171456396e-6, 0.009728278736790536)

julia> @btime f_noinline($x4)
  16.422 ns (0 allocations: 0 bytes)
(2.895112921784915e-5, 3.3413971489423987e-9, 1.1891815171456396e-6, 0.009728278736790536)

julia> @btime f_inline($x5)
  16.000 ns (0 allocations: 0 bytes)
(7.091591273637825e-10, 4.076330221332849e-9, 6.081974920592948e8, 9.138314997534115e-7, 2.331481510779992e-6)

julia> @btime f_noinline($x5)
  20.699 ns (0 allocations: 0 bytes)
(7.091591273637825e-10, 4.076330221332849e-9, 6.081974920592948e8, 9.138314997534115e-7, 2.331481510779992e-6)

julia> @btime f_inline($x8)
  25.700 ns (0 allocations: 0 bytes)
(0.0002007185053474872, 0.5034954054504044, 0.02126212211489744, 0.0038127283883166246, 791725.4233061415, 1565.597292172173, 1.8374960817981415e10, 1223.294964858633)

julia> @btime f_noinline($x8)
  33.785 ns (0 allocations: 0 bytes)
(0.0002007185053474872, 0.5034954054504044, 0.02126212211489744, 0.0038127283883166246, 791725.4233061415, 1565.597292172173, 1.8374960817981415e10, 1223.294964858633)

The SLP vectorizer fails to SIMD the unrolled code.
I experimented with @simd ivdep loops like

function f_inline_simd(x::NTuple{N}) where {N}
    y = ntuple(_ -> 0.0, Val(N))
    @inbounds @simd ivdep for n ∈ 1:N
        y = Base.setindex(y, BetterExp.exp(x[n]), n)
    end
    y
 end
using StaticArrays
function f_inline_simd_v2(x::NTuple{N}) where {N}
    y = MVector{N,Float64}(undef)
    @inbounds @simd ivdep for n ∈ 1:N
        y[n] = BetterExp.exp(x[n])
    end
    y.data
end

I think these failed because of the control flow.
The GPL-ed implementation in GLIBCVectorMath is simpler:

julia> using StaticArrays, BetterExp, GLIBCVectorMath, BenchmarkTools

julia> x8 = ntuple(_ -> 10randn(), Val(8));

julia> function f_inline_simd_v2(x::NTuple{N}) where {N}
           y = MVector{N,Float64}(undef)
           @inbounds @simd ivdep for n ∈ 1:N
               y[n] = BetterExp.exp(x[n])
           end
           y.data
       end
f_inline_simd_v2 (generic function with 1 method)

julia> function f_inline_simd_v3(x::NTuple{N}) where {N}
           y = MVector{N,Float64}(undef)
           @inbounds @simd ivdep for n ∈ 1:N
               y[n] = GLIBCVectorMath.gexp(x[n])
           end
           y.data
       end
f_inline_simd_v3 (generic function with 1 method)

julia> @btime f_inline_simd_v2($(Ref(x8))[])
  27.192 ns (0 allocations: 0 bytes)
(20.618118362539537, 1.2218788833081655, 12.885978759844296, 3279.062301568345, 8261.353401512362, 110815.80909553879, 10.479053765988528, 0.03200963068220368)

julia> @btime f_inline_simd_v3($(Ref(x8))[])
  4.820 ns (0 allocations: 0 bytes)
(20.61811836253954, 1.2218788833081655, 12.885978759844297, 3279.062301568345, 8261.353401512362, 110815.80909553879, 10.479053765988528, 0.03200963068220367)

julia> f_inline_v4(x::NTuple{N}) where {N} = ntuple(@inline(i -> GLIBCVectorMath.gexp(x[i])), Val(N))
f_inline_v4 (generic function with 1 method)

julia> @btime f_inline_v4($(Ref(x8))[])
  4.810 ns (0 allocations: 0 bytes)
(20.61811836253954, 1.2218788833081655, 12.885978759844297, 3279.062301568345, 8261.353401512362, 110815.80909553879, 10.479053765988528, 0.03200963068220367)

8 exps in 4.8 ns looks pretty good, but and it is SIMD, but the autovectorizer goofed and instead of using a gather, is loading from the table in an incredibly awkward way.
Using explicit SIMD is thus much faster:

julia> using SIMDPirates

julia> sx8 = SVec(Core.VecElement.(x8));

julia> @btime GLIBCVectorMath.gexp($(Ref(sx8))[])
  3.061 ns (0 allocations: 0 bytes)
SVec{8,Float64}<20.61811836253954, 1.2218788833081655, 12.885978759844297, 3279.062301568345, 8261.353401512362, 110815.80909553879, 10.479053765988528, 0.03200963068220367>

Maybe simplifying the control flow will help. Note that I was lazy and didn’t actually implement it for gexp; that implementation is cheating compared to BetterExp and will be incorrect for extreme values. Thus it may not be possible to simplify BetterExp.exp enough to get it to SIMD inside ntuples.

There are probably a lot of isolated calls to exp, where inlining won’t provide much benefit, and could possibly regress compile times, so a comprehensive test suite would be good.
If we get the definition simple enough so that the ntuple will vectorize (like it does for GLIBCVectorMath.gexp, then it may be worth inlining in more cases than just loops. But for now, making inlining in loops more aggressive would probably be the ideal fix.

1 Like

Would it be possible to teach Julia that tuples never alias? that seems it would be possible, and would remove the need for the ivdep

1 Like

LLVM should already know that thanks to TBAA. I recall you made the lookup table a tuple for this reason.

The @code_llvm seems to show the table as a constant (see the @_j_const1 = private unnamed_addr constant [256 x i64]):

; julia> @code_llvm debuginfo=:none dump_module=true exp(1.3)
; ModuleID = 'exp'
source_filename = "exp"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@_j_const1 = private unnamed_addr constant [256 x i64] [i64 0, i64 -6142897681236050754, i64 -7250770945965870283, i64 -6102340764187295104, i64 -5926688068634967968, i64 7304899980040453068, i64 -6368016111728951420, i64 -8881012293317012303, i64 -8254999500979600012, i64 -6552626363961172088, i64 -6282397874615617827, i64 -8998055904294033354, i64 7025764149530827720, i64 -6579597711007938593, i64 -7980204547173321853, i64 -8804350596743218919, i64 -7448754392280758001, i64 -7714454019193410427, i64 -8065722004454357001, i64 -6768672491432223253, i64 4625447420914028962, i64 -6507437967993230908, i64 -5890431894142659814, i64 -8268319537450507458, i64 -6588463881359566000, i64 -6800120033530991947, i64 -5804811450238357045, i64 -5872352343580605270, i64 -6457807158549860897, i64 -6264139202367114963, i64 -9060861363044905391, i64 6409016600702337986, i64 -6083955485986434694, i64 8111403855127100352, i64 -6381166394007056919, i64 -7736736493901226705, i64 -8191586632022897803, i64 -7516033227277759972, i64 -6678350199451788077, i64 -8069948950579766165, i64 -7061129063625606742, i64 -6592741095162757082, i64 -6408079866331342564, i64 -6772857755046195636, i64 -8232010316118848298, i64 -7831176193893685223, i64 -9033623501717775210, i64 -8272501334612569391, i64 -7371767541561728456, i64 -8448113947327720438, i64 -8448100004451408321, i64 -7790560478176031115, i64 -9029036357118165795, i64 -7060949263364268247, i64 -6714157997352459392, i64 -6106157914738541713, i64 -8785785521786855711, i64 -6678086686518210798, i64 -5786359712113305201, i64 -6524935764185769170, i64 -8069656110822063349, i64 -7128389424309332599, i64 8197374510970580822, i64 -5862849083290580886, i64 -8722620365124028651, i64 -8551469058473676035, i64 -8461382505746081743, i64 -6786028844707419599, i64 -7844360117898963201, i64 -6114963182057162196, i64 -6259063651402866704, i64 -6272559691648300998, i64 -6565278868775982646, i64 -6700372018822482615, i64 -6866990326028576667, i64 -6691335021213433618, i64 -5961736921781264394, i64 -7691104178333258508, i64 -6880441204378195740, i64 9215438945652695411, i64 -8443160070567349214, i64 -6249891888336449164, i64 -6623675452593766373, i64 -8371056861974650844, i64 -5835514984389988823, i64 -6641643988865956215, i64 -6087685864315377514, i64 -8609686243585508908, i64 5229890865167698221, i64 -5875970320184651215, i64 -7163984275856097193, i64 -6781162727645957578, i64 8341940443189165264, i64 -6119102295816622383, i64 -7366583685646658594, i64 -6074034843053869098, i64 9026550499758189095, i64 -8046579852547473132, i64 -7429570825608635123, i64 -6528834979095860226, i64 -6510804616381387610, i64 8801450124166640110, i64 -6330628572909181375, i64 -6024367703840567570, i64 -6060380362815746175, i64 -8676955564519543513, i64 -6294535135968783858, i64 -6375583659597426479, i64 -7370862863470745638, i64 -6366543788560431922, i64 -9037161965233387523, i64 -6627719717761120309, i64 -7622998743989070807, i64 -6753787479772857854, i64 4343098583035758263, i64 -6803293868782941294, i64 -7231119162382409400, i64 -6231303328905761936, i64 -6483488146515895288, i64 -7041917693342124300, i64 -6469943688130046588, i64 -7474229506805782019, i64 -6276255061881601249, i64 -6735605233889800266, i64 -6762609795605913052, i64 -6316736350258711563, i64 6932870882017454451, i64 -8645063054064708296, i64 -6767044787892700212, i64 8806419991805368126, i64 -5933844273508394306, i64 -7172316809019492795, i64 -5888773506129913298, i64 -6366137610341369116, i64 -6199486920510040957, i64 6473659981986279047, i64 -6077854580833755277, i64 -6415606906510192503, i64 -6366049616388587149, i64 -8050378134830724099, i64 -6347999685386278968, i64 -7244198172700217847, i64 -7464856667543413441, i64 -7834133901593535242, i64 -6127251658798071418, i64 -6771248372730376230, i64 -6982899473543282382, i64 -7946651663102671120, i64 8095188389446241890, i64 -6059607010233432944, i64 -6428883901107791343, i64 -7752923863426464039, i64 -6820660362739921390, i64 -7379088288885685769, i64 -6212637507786127444, i64 -5825309412123780320, i64 -6910658344377871636, i64 -8802151559561370570, i64 -7378995398493706599, i64 -7108760691389474801, i64 -7572612672765910821, i64 -7005140288556099720, i64 -5915250296514853749, i64 -6595274906835327272, i64 -8806523339128080586, i64 -7667093597184316278, i64 -6838412177916690491, i64 -6113313498326224927, i64 -5870099926971726364, i64 -6302426247689430812, i64 -5964636979952332274, i64 -5825006043503628110, i64 -6811274961439797681, i64 8982868384806478498, i64 -6392401237235857479, i64 -6838238041659408884, i64 8969416262126137488, i64 -6522946791140509484, i64 8658707270930211619, i64 -6135597733315612003, i64 -6459837081213606556, i64 -7175889542941067077, i64 -7621725973128794829, i64 -6725489660316480809, i64 -6815541611708144292, i64 7280745317454340178, i64 -7220825332701416616, i64 -6518243586222248360, i64 -7999907604237517377, i64 -8603369639986762054, i64 -6333535058395925639, i64 -7499946937261742220, i64 -7517940855771311187, i64 -6509114003713494274, i64 -6765798591269325211, i64 -7720541065239164590, i64 -5847022917240547846, i64 -6045160541700798032, i64 -6788233664789128903, i64 -7878083902748637910, i64 -5955025933407920570, i64 -6752142082256188957, i64 -6878221829729614263, i64 -5779322421507363042, i64 -9062425393511022126, i64 -5779280051447815674, i64 -6878137089299898872, i64 -6297151408470641860, i64 -9143404986241822052, i64 -6589842552810478293, i64 -7256353794806525830, i64 -7985915373285606622, i64 -7886814561870294709, i64 -6310533014060742350, i64 8191079523077777333, i64 -6193395890803063590, i64 8434317553931620457, i64 -6026718935264717196, i64 -6301416538828323847, i64 -6702214872381543565, i64 -6796768371535279749, i64 -7683955345205966488, i64 -6819242003766867471, i64 -7791997250069421610, i64 -9012450415488555897, i64 6912300261044315859, i64 -6377799784458517609, i64 -7346051188594352011, i64 -7192906224468544891, i64 -6751530822979124460, i64 -7863897231554661280, i64 -6796521358938601897, i64 -6400181769089606818, i64 -5931784523324228330, i64 -6616308720119471832, i64 -6697350504632619503, i64 -8962638046037327882, i64 -6161375944097960587, i64 -7863713406807225213, i64 -6377502270446258369, i64 -6863867707811131175, i64 -7742046249523819922, i64 -6791763279119226896, i64 -7940157671492516592, i64 -7075442966682052057, i64 -6453922577827610327, i64 -7804978761666151785, i64 -5990004343194160525, i64 -7359074796829784768, i64 -6940216133781273352, i64 -6832105780228080109, i64 -5841289834741351288, i64 -8696547905883959080, i64 -5989860371902311149, i64 7129149566058721777, i64 8408196149339104217]

define double @julia_exp_4268(double %0) {
top:
  %1 = fmul contract double %0, 0x40771547652B82FE
  %2 = fadd contract double %1, 0x4338000000000000
  %3 = bitcast double %2 to i64
  %4 = trunc i64 %3 to i32
  %5 = fadd double %2, 0xC338000000000000
  %6 = fmul contract double %5, 0x3F662E42FEF80000
  %7 = fsub contract double %0, %6
  %8 = fmul contract double %5, 0x3D31CF79ABC9E3B4
  %9 = fsub contract double %7, %8
  %10 = ashr i32 %4, 8
  %11 = and i64 %3, 255
  %12 = getelementptr inbounds [256 x i64], [256 x i64]* @_j_const1, i64 0, i64 %11
  %13 = load i64, i64* %12, align 8
  %14 = and i64 %13, 4503599627370495
  %15 = or i64 %14, 4607182418800017408
  %16 = bitcast i64 %15 to double
  %17 = lshr i64 %13, 8
  %18 = or i64 %17, 4323455642275676160
  %19 = bitcast i64 %18 to double
  %20 = fmul contract double %9, 0x3FA5555565BBF98F
  %21 = fadd contract double %20, 0x3FC555557E55EFED
  %22 = fmul contract double %9, %21
  %23 = fadd contract double %22, 0x3FDFFFFFFFFFFFFB
  %24 = fmul contract double %9, %23
  %25 = fadd contract double %24, 0x3FEFFFFFFFFFFFB1
  %26 = fmul double %9, %25
  %27 = fmul contract double %26, %16
  %28 = fadd contract double %27, %19
  %29 = fadd double %28, %16
  %30 = call double @llvm.fabs.f64(double %0)
  %31 = fcmp ugt double %30, 0x4086232BDD7ABCD2
  br i1 %31, label %L44, label %L68

L44:                                              ; preds = %top
  %32 = fcmp ult double %0, 0x40862E42FEFA39EF
  br i1 %32, label %L47, label %L46

L46:                                              ; preds = %L47, %L44
  %merge = phi double [ 0x7FF0000000000000, %L44 ], [ 0.000000e+00, %L47 ]
  ret double %merge

L47:                                              ; preds = %L44
  %33 = fcmp ugt double %0, 0xC0874910D52D3052
  br i1 %33, label %L50, label %L46

L50:                                              ; preds = %L47
  %34 = icmp sgt i32 %4, -13313
  br i1 %34, label %L68, label %L53

L53:                                              ; preds = %L50
  %narrow = add nsw i32 %10, 53
  %35 = zext i32 %narrow to i64
  %36 = shl i64 %35, 52
  %37 = bitcast double %29 to i64
  %38 = add i64 %36, %37
  %39 = bitcast i64 %38 to double
  %40 = fmul double %39, 0x3CA0000000000000
  ret double %40

L68:                                              ; preds = %L50, %top
  %41 = zext i32 %10 to i64
  %42 = shl i64 %41, 52
  %43 = bitcast double %29 to i64
  %44 = add i64 %42, %43
  %45 = bitcast i64 %44 to double
  ret double %45
}

define nonnull {}* @jfptr_exp_4269({}* %0, {}** %1, i32 %2) #0 {
top:
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #5
  %ptls_i8 = getelementptr i8, i8* %thread_ptr, i64 -32768
  %3 = bitcast {}** %1 to double**
  %4 = load double*, double** %3, align 8
  %5 = load double, double* %4, align 8
  %6 = call double @julia_exp_4268(double %5)
  %7 = call noalias nonnull {}* @jl_gc_pool_alloc(i8* %ptls_i8, i32 1400, i32 16) #1
  %8 = bitcast {}* %7 to i64*
  %9 = getelementptr inbounds i64, i64* %8, i64 -1
  store atomic i64 140545523352272, i64* %9 unordered, align 8
  %10 = bitcast {}* %7 to double*
  store double %6, double* %10, align 8
  ret {}* %7
}

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @julia.gc_alloc_obj(i8*, i64, {}*) #1

; Function Attrs: nounwind readnone speculatable willreturn
declare double @llvm.fabs.f64(double) #2

; Function Attrs: nounwind readnone speculatable willreturn
declare double @llvm.pow.f64(double, double) #2

; Function Attrs: argmemonly nounwind willreturn
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #3

; Function Attrs: argmemonly nounwind willreturn
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #3

; Function Attrs: inaccessiblemem_or_argmemonly
declare void @jl_gc_queue_root({}*) #4

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @jl_gc_pool_alloc(i8*, i32, i32) #1

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @jl_gc_big_alloc(i8*, i64) #1

; Function Attrs: allocsize(1)
declare noalias nonnull {}* @julia.gc_alloc_bytes(i8*, i64) #1

attributes #0 = { "thunk" }
attributes #1 = { allocsize(1) }
attributes #2 = { nounwind readnone speculatable willreturn }
attributes #3 = { argmemonly nounwind willreturn }
attributes #4 = { inaccessiblemem_or_argmemonly }
attributes #5 = { nounwind }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 1, !"Debug Info Version", i32 3}

So I don’t know why it would have a problem. Maybe it’s an LLVM bug?
I need to learn more about how LLVM works so I can look into things like this.
That reminds me there was a Julia issue where Keno was going to walk me through an LLVM fix that I need to get back to…

I really appreciate the detailed response! The community is incredible.

  1. Thanks for the great explanation here. Truth be told I’m a bit clueless about where the Python ends and cpp begins in PyTorch.
  2. For a small kernel size (100x100), “pre-warming” the forward pass got my Flux code to perform about as well as my PyTorch code – thanks for the tip.
  3. I tried out the “BetterExp” package but it doesn’t seem to play well with Zygote… I get a “Non-differentiable function” error. Looking through a flame graph it looks like most of the time seems to be taken up by matmul operations so nothing sticks out as unexpected.
  4. Most of what we do is on GPU – so I’m glad to hear we should expect PyTorch performance on GPU (I haven’t had a chance to test it out as our lab requires cuda v9 for some old TensorFlow code so I can’t use the latest version of CUDA.jl or Flux on our GPUs – this will take a bit of messing around on my end).
  5. Both were using Float64 (Gaussian process stuff is less of a hassle with at this precision).

Thanks for being blunt with the performance of Flux on CPU. I’m glad to hear there are some exciting changes coming down the pipe.

One more dumb question: where is the best place to learn what changes are coming to Flux & NNLib?

2 Likes

You can watch the repository on Github and see PRs.

1 Like