OpenBLAS + Flux bottleknecks?

I’ve been reimplementing some of my PyTorch code in julia as a way to learn the language and I came across some behaviour I don’t understand when using Flux for doing gradient based optimization on a CPU. In this practice I was implementing type-II maximum likelihood for GP-regression (so low-dimensional optimization with large linear algebra operations for each loss function evaluation).

If I set the number of OpenBLAS threads to maximize CPU usage (through BLAS.set_num_threads(12) in my case ) my julia program expectedly uses all my computer’s resources while optimizing for a fixed number of iterations.

If I set the number of OpenBLAS threads to 1 my julia program expectedly uses a single CPU core but completes the job in about 80% of the time. (Note that this is about 15% slower than my PyTorch implementation which runs on all CPU cores).

This makes me think there’s something I’ve misunderstood how to best use julia + Flux. Are there some bottleknecks in the OpenBLAS julia interface that are being actively worked on?

I’ve included my code below if someone thinks this might be the result of a newb mistake (which it very well might be!) :slight_smile:

using Flux 
using LinearAlgebra
using Plots

"""
Computes the distance between each coordinate for each point in x
Returns a matrix whose dimension is n+1 where n is the dimension of x

Δ(x::AbstractArray{T,2}) -> AbstractArray{T,3}
Δ(x::AbstractArray{T,1}} -> AbstractArray{T,2}
"""
function Δ(x1::AbstractArray{T,2}, x2::AbstractArray{T,2}) where T
    s1, s2 = size(x1)
    t1, t2 = size(x2)
    return reshape(x1, (s1, s2, 1)) .- reshape(x2, (t1, 1, t2)) 
end
function Δ(x1::AbstractArray{T,1}, x2::AbstractArray{T,1}) where T
    n1 = length(x1)
    n2 = length(x2)
    return reshape(Δ(reshape(x1, 1, n1), reshape(x2, 1, n2)), (n1, n2))
end
function Δ(x::AbstractArray)
    return Δ(x, x)
end 

"""
Squared exponential kernel with length scale ℓ and σf = exp(logσf)
"""
struct sekernel{T <: AbstractArray}
    ℓ::T
    logσf::T
end
Flux.@functor sekernel
sekernel(ℓ::AbstractFloat, σf::AbstractFloat) = sekernel([ℓ], [σf])

function (k::sekernel)(x1::AbstractArray{T,2}, x2::AbstractArray{T,2}) where T 
    out = exp.(k.logσf).^2 .* exp.(-sum(Δ(x1, x2).^2, dims=1) ./ (2 * k.ℓ.^2))
    return out[1,:,:]
end
function (k::sekernel)(x1::AbstractArray{T,1}, x2::AbstractArray{T,1}) where T
    return k(reshape(x1, 1, length(x1)), reshape(x2, 1, length(x2)))
end
function (k::sekernel)(x::AbstractArray)
    return k(x, x)
end

"""
Gaussian process container 
x : training inputs, f : training targets, k : kernel struct, logσn : i.i.d noise 
"""
struct GP{U,S <: AbstractArray,T <: AbstractArray,I <: AbstractArray}
    x::S
    f::T
    k::U
    logσn::I
end
Flux.@functor GP
function GP(x::AbstractArray, f::AbstractArray, logσn::AbstractFloat, k)
    return GP(x, f, k, [logσn])
end
Flux.trainable(g::GP) = (g.logσn, g.k)
function (g::GP)(xs::AbstractArray)
    μ, Σ = infer(g.f, g.k(g.x) + I * exp.(g.logσn[1]), g.k(g.x, xs), g.k(xs))
    return (μ, Σ)
end

""" 
Function for performing inference with a GP 
"""
function infer(f::AbstractArray{T}, # check this (maybe just AbstractArray{T}?)
                  K11::AbstractArray{T,2}, 
                  K12::AbstractArray{T,2}, 
                  K22::AbstractArray{T,2}) where T
    C = cholesky(K11 + I * 1e-12) # adding 1e-12 for numerical stability
    K21 = transpose(K12)
    μ = K21 * (C \ f)
    Σ = K22 - K21 * (C \ K12)
    return (μ, Σ)
end

"""
Computes the log_mll for a Gaussian process 
"""
function log_mll(f::AbstractArray{T}, K11::AbstractArray{T,2}) where T
    C = cholesky(K11)
    -0.5 * f' * (C \ f) - 0.5 * logdet(C) - length(f) / 2 * log(2 * π)
end
function log_mll(g::GP)
    return log_mll(g.f, g.k(g.x) + I * exp.(g.logσn[1]))
end

"""
type-ii maximum likelihood training of GP 
"""
function train_gp!(model::GP, epochs::Integer, lr=1e-1)
    opt = Flux.Optimise.ADAM(lr)
    ps = Flux.params(model)
    for i in 1:epochs
        gs = gradient(ps) do 
            return -log_mll(model)
        end
        Flux.Optimise.update!(opt, ps, gs)
        if i % 1000 == 0
            @info "Epoch $i | len $(model.k.ℓ)  | σf $(exp.(model.k.logσf)) | σn $(exp.(model.logσn)) | lml $(log_mll(model))"
        end
    end
end

function main()
    x = collect(LinRange(0., 10., 100))
    y_true = x + sin.(x) * 5 .- 10.0 
    y = y_true + randn(length(x)) * 2.0

    model = GP(x, y, -4.0, sekernel(0.1, -4.0)) 
    train_gp!(model, 5000)

    xt = collect(LinRange(-0., 10., 200))
    μ, Σ = model(xt)
    σ = 2 * sqrt.(diag(Σ))

    p = scatter(x, y, label="Measurements", alpha=0.5, xlabel="x", ylabel="y")
    p = plot!(x, y_true, linestyle=:dash, label="Generating Func.", linewidth=2)
    p = plot!(xt, μ, ribbon=σ, label="Exact GP", linewidth=2)
end

MKL’s threading is much more efficient than OpenBLAS’s, especially when arrays are relatively small.
I recommend using MKL.jl to swap your default library with MKL, even if you’re on Ryzen.
Some reports suggest recent versions of MKL have great performance on Ryzen, and do make use AVX2 and FMA. (EDIT: Actually, IIRC, the performance was only good with double precision, but bad with single. So if you’re on Ryzen, I’d test Float32 and switch back to OpenBLAS if that wasn’t fixed.)

Additionally, if possible, I recommend setting your BLAS to be single threaded, and then splitting up your training data into batches. You can then evaluate these batches in parallel using Julia’s threading.

4 Likes

Thanks or the suggestions, I appreciate you taking the time to respond. I’m on intel so switching to MKL sped things up!

I’m still a bit confused about why running with 1 BLAS thread vs. 12 is faster when using Flux (where as PyTorch seems to maximize thread usage with slightly better results. As far as I know, PyTorch uses MKL for CPU matrix operations as well).

EDIT: To explain further, the most expensive operation in the program is computing a cholesky decomposition which scales ~ O(n^3) where n is the dataset size. This is something I would expect to benefit from multiple blas threads.

There are a lot of constraints on order, making it much harder to parallelize than gemm.
To elaborate, we can imagine solving for an upper triangular matrix blockwise, by dividing it into M blocks.
Before solving for anything else, we need to solve for the first diagonal block U_{1,1} = cholesky(S_{1,1}) – serially.
Now we can solve U_{1,m} = U_{1,1} \ S_{1,m} in parallel for m = 2,...,M.
Then we have to calculate U_{2,2} = cholesky(S_{2,2} - U_{1,2}'*U_{1,2}).*
Which you can then use for solving U_{2,m} = U_{2,2} \ S_{2,m} in parallel for m = 3,...,M.
Etc. As we finish more and more diagonals, the m = i,...,M gets short, and there aren’t many blocks we can calculate in parallel.

In short, the problem is that if the blocks are too small, we have to spend all our time on communication between cores, because for every one of these diagonal blocks we need communication between cores.
If the blocks are too big, then we wont’ be able to utilize all our cores on those triangular solves for a substantial chunk or even all of the factorization.

  • Actually, I’d have each thread calculate S_{m,m} - U_{1,m}'*U_{1,m} in parallel, but these individual cholesky factorizations on the blocks still have to happen serially.

FWIW, I do see a fairly good performance improvement:

julia> using LinearAlgebra, BenchmarkTools

julia> S = randn(10_000, 12_000) |> x -> x * x'; # Wishart;

julia> S2 = similar(S);

julia> BLAS.set_num_threads(1);

julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 1 thread
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     3.145 s (0.00% GC)
  median time:      3.148 s (0.00% GC)
  mean time:        3.148 s (0.00% GC)
  maximum time:     3.151 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1

julia> BLAS.set_num_threads(18);

julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 18 threads
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     377.574 ms (0.00% GC)
  median time:      379.544 ms (0.00% GC)
  mean time:        380.256 ms (0.00% GC)
  maximum time:     385.087 ms (0.00% GC)
  --------------
  samples:          14
  evals/sample:     1

julia> @benchmark copyto!($S2, $S) # copyto! is slow
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     79.912 ms (0.00% GC)
  median time:      80.027 ms (0.00% GC)
  mean time:        80.050 ms (0.00% GC)
  maximum time:     80.408 ms (0.00% GC)
  --------------
  samples:          63
  evals/sample:     1

However, these matrices are much larger than yours. If they’re 200x200:

julia> S = randn(200, 300) |> x -> x * x'; # Wishart;

julia> S2 = similar(S);

julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 18 threads
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     92.795 μs (0.00% GC)
  median time:      142.325 μs (0.00% GC)
  mean time:        142.279 μs (0.00% GC)
  maximum time:     223.126 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> BLAS.set_num_threads(1);

julia> @benchmark cholesky!(Symmetric(copyto!($S2, $S))) # 1 thread
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  2
  --------------
  minimum time:     76.469 μs (0.00% GC)
  median time:      77.018 μs (0.00% GC)
  mean time:        77.222 μs (0.00% GC)
  maximum time:     170.254 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I also see very little improvement.

You could call Symmetric or Hermitian on the matrices before calling cholesky to skip the symmetry check.

4 Likes

Thank you for your extremely thorough response, greatly appreciated! Take care