Slower @threads than serial for array computations

foobar_lv2 · April 27, 2019, 12:48am

Exactly.

As it appears, certain common julia workloads like filling of newly allocated large arrays, appear to profit a lot (~3x) from always. For whatever reason, you already had the “correct” config for julia, while I had it wrong, leading me down this rabbit hole. In a perfect world, madvise would be the optimal setting: Applications that profit from THP can request it, and applications like e.g. mongodb can simply not request THP, and everybody is happy.

Julia unfortunately fails to request THP via madvise, so it is better to override it system-wide. The mongodb people probably lack control over the madvise setting in their own application (likely because they use system malloc instead of mman.h, just like julia). Otherwise they would have recommended madvise as the kernel setting. What a mess!

dataDiver · April 29, 2019, 7:57am

@tkoolen BTW we don’t need StaticArrays or muladd, this fairly idiomatic Julia code runs at least as fast (and may even be faster by a small margin):

using Base.Threads
using LinearAlgebra

abstract type AbstractKernel end
struct DotProduct <: AbstractKernel end

function kernel(::Type{DotProduct}, x, y)
  ret::eltype(x) = 0
  @inbounds @simd for i = 1:length(x)
    @inbounds ret += x[i]*y[i]
  end
  return ret
end

function calculateKernelMatrix(::Type{K}, data) where {K <: AbstractKernel}
    n = size(data, 1)
    mat = zeros(Float64, n, n)
    @threads for j in 1 : n
      @inbounds for i in j : n
        mat[i, j] = kernel(K, data[i], data[j])
      end
    end
    return Symmetric(mat, :L)
end

# Alternative approach: Make the data a vector of vectors
function simulateData(n::Int64, p::Int64)::Array{Array{Float64, 1}}
  data::Array{Array{Float64, 1}} = Array{Array{Float64, 1}}(undef, 0)
  for i in 1:n
    push!(data, rand(Float64, p))
  end
  return data
end

n = 10_000; p = 32
x = simulateData(n, p);

using BenchmarkTools
@btime calculateKernelMatrix(DotProduct, $x)
``

tkoolen · April 29, 2019, 12:16pm

Hmm, interesting, I can reproduce. It looks like the regular-Vector version has a branch that unrolls the loop four times (in addition to using 4-wide SIMD instructions), while the StaticVector version doesn’t do that; that could explain the difference. Not sure why. By the way, in

  @inbounds @simd for i = 1:length(x)
    @inbounds ret += x[i]*y[i]
  end

the second @inbounds isn’t necessary (I left it in there by mistake in my version).

Also, for readability I’d use Vector{Float64} instead of Array{Float64, 1} (just an alias for the latter), and you can use the comprehension [rand(p) for _ = 1 : n] to replace simulateData.

dataDiver · April 29, 2019, 4:39pm

@tkoolen regarding the misplaced @inbounds I copied without checking, and thanks again for the advice!

dataDiver · May 6, 2020, 11:51pm

I guess I’m resurrecting this topic to see if the Julia community can squeeze more performance out of this Kernel algorithm. I’ve written the equivalent algorithm in D and I am getting stunning outputs in comparison to Julia’s.

To Recap, I’m doing kernel matrix computations, for example here are some kernel types:

using Base.Threads: @threads, @spawn
using Random: shuffle!

# Kernel Methods
#===============#
abstract type AbstractKernel end

struct DotProduct <: AbstractKernel end
function kernel(K::DotProduct, x, y)
  ret::eltype(x) = 0
  @inbounds @simd for i = 1:length(x)
    ret += x[i]*y[i]
  end
  return ret
end

struct Gaussian{T} <: AbstractKernel
  gamma::T
end
function kernel(K::Gaussian, x, y)
  ret::eltype(x) = 0
  @inbounds @simd for i in 1:length(x)
    temp = x[i] - y[i]
    ret -= temp * temp
  end
  return exp(K.gamma * ret)
end

Here’s is the kernel function I’m currently using:

function calculateKernelMatrix(K::AbstractKernel, data::Array{T}) where {T <: AbstractFloat}
  n = size(data)[2]
  mat::Array{T, 2} = zeros(T, n, n)
  @threads for j in 1:n
    @inbounds for i in j:n
      mat[i, j] = kernel(K, data[:, i], data[:, j])
      mat[j, i] = mat[i, j]
    end
  end
  return mat
end

The timings:

x = rand(Float32, (784, 10_000));
calculateKernelMatrix(DotProduct(), x[:, 1:10]); # for compilation
calculateKernelMatrix(Gaussian(Float32(1.0)), x[:, 1:10]); # for compilation
@time mat1 = calculateKernelMatrix(DotProduct(), x);
# 38.315531 seconds (100.01 M allocations: 304.387 GiB, 23.55% gc time)
@time mat2 = calculateKernelMatrix(Gaussian(Float32(1.0)), x);
# 38.029511 seconds (100.01 M allocations: 304.387 GiB, 24.31% gc time)

D does the same computations in about 1.5 - 1.6 seconds, I wonder if Julia can do any better. I also tried using LoopVectorization: @avx but it didn’t make much difference, maybe I wasn’t using it correctly. I also tried running Julia with julia -O3 --check-bounds=no but it made no difference.

Many Thanks

P.s. Chapel does it in ~ 9 seconds.

stillyslalom · May 7, 2020, 12:56am

The number of allocations (and the fraction of time spent on garbage collection) should throw up some big red flags. Unless you use @views, each data[:, i] array slice creates a copy, which generates a ton of memory pressure. You may also run into inlining problems if your inner loop is a separate function, although it didn’t seem to have much impact in this case.

using LinearAlgebra, LoopVectorization

@inline function dotkernel(x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T,N}
    ret = zero(T)
    m = length(x)
    @avx for k in 1:m
        ret += x[k] * y[k]
    end
    return ret
end

@inbounds function kernelfn(kf::F, data::Array{T,2}) where {F, T}
    m, n = size(data)
    mat = zeros(T, n, n)
    @threads for j in 1:n
        @views for i in j:n
            mat[i,j] = kf(data[:, i], data[:, j])
        end
    end
    return Symmetric(mat, :L)
end

@time mat1 = calculateKernelMatrix(DotProduct(), x); # 204.092815 seconds (100.01 M allocations: 304.387 GiB, 42.29% gc time)

@time kernelfn(dotkernel, x) # 7.082682 seconds (165 allocations: 381.486 MiB, 1.69% gc time)

dataDiver · May 7, 2020, 1:29am

Many thanks! Using your new approach on my machine I’m now getting ~ 2.4 seconds for the calculations. As you said improvements are due to inlining and using views. At the moment, I’m not seeing a difference when substituting @inbounds @simd for @avx. Also passing function kf::F instead of K::AbstractKernel doesn’t change the performance - even though at the moment your version has hard coded dotkernel in your kernelfn function - it doesn’t see to change the performance - probably be cause the compiler takes care of it.

Topic		Replies	Views
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4036	February 11, 2019
Data structures for threaded computing Performance	23	2922	October 23, 2019
What's the problem with this simple multi-thread code? General Usage question	17	1190	March 11, 2022
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	3014	July 16, 2018
Why doesn't multithreading help here? Performance	12	1461	August 22, 2020

Slower @threads than serial for array computations

Related topics