Slower @threads than serial for array computations

Exactly.

As it appears, certain common julia workloads like filling of newly allocated large arrays, appear to profit a lot (~3x) from always. For whatever reason, you already had the “correct” config for julia, while I had it wrong, leading me down this rabbit hole. In a perfect world, madvise would be the optimal setting: Applications that profit from THP can request it, and applications like e.g. mongodb can simply not request THP, and everybody is happy.

Julia unfortunately fails to request THP via madvise, so it is better to override it system-wide. The mongodb people probably lack control over the madvise setting in their own application (likely because they use system malloc instead of mman.h, just like julia). Otherwise they would have recommended madvise as the kernel setting. What a mess!

1 Like

@tkoolen BTW we don’t need StaticArrays or muladd, this fairly idiomatic Julia code runs at least as fast (and may even be faster by a small margin):

using Base.Threads
using LinearAlgebra

abstract type AbstractKernel end
struct DotProduct <: AbstractKernel end

function kernel(::Type{DotProduct}, x, y)
  ret::eltype(x) = 0
  @inbounds @simd for i = 1:length(x)
    @inbounds ret += x[i]*y[i]
  end
  return ret
end

function calculateKernelMatrix(::Type{K}, data) where {K <: AbstractKernel}
    n = size(data, 1)
    mat = zeros(Float64, n, n)
    @threads for j in 1 : n
      @inbounds for i in j : n
        mat[i, j] = kernel(K, data[i], data[j])
      end
    end
    return Symmetric(mat, :L)
end

# Alternative approach: Make the data a vector of vectors
function simulateData(n::Int64, p::Int64)::Array{Array{Float64, 1}}
  data::Array{Array{Float64, 1}} = Array{Array{Float64, 1}}(undef, 0)
  for i in 1:n
    push!(data, rand(Float64, p))
  end
  return data
end

n = 10_000; p = 32
x = simulateData(n, p);

using BenchmarkTools
@btime calculateKernelMatrix(DotProduct, $x)
``

Hmm, interesting, I can reproduce. It looks like the regular-Vector version has a branch that unrolls the loop four times (in addition to using 4-wide SIMD instructions), while the StaticVector version doesn’t do that; that could explain the difference. Not sure why. By the way, in

  @inbounds @simd for i = 1:length(x)
    @inbounds ret += x[i]*y[i]
  end

the second @inbounds isn’t necessary (I left it in there by mistake in my version).

Also, for readability I’d use Vector{Float64} instead of Array{Float64, 1} (just an alias for the latter), and you can use the comprehension [rand(p) for _ = 1 : n] to replace simulateData.

@tkoolen regarding the misplaced @inbounds I copied without checking, and thanks again for the advice!

1 Like

I guess I’m resurrecting this topic to see if the Julia community can squeeze more performance out of this Kernel algorithm. I’ve written the equivalent algorithm in D and I am getting stunning outputs in comparison to Julia’s.

To Recap, I’m doing kernel matrix computations, for example here are some kernel types:

using Base.Threads: @threads, @spawn
using Random: shuffle!

# Kernel Methods
#===============#
abstract type AbstractKernel end

struct DotProduct <: AbstractKernel end
function kernel(K::DotProduct, x, y)
  ret::eltype(x) = 0
  @inbounds @simd for i = 1:length(x)
    ret += x[i]*y[i]
  end
  return ret
end

struct Gaussian{T} <: AbstractKernel
  gamma::T
end
function kernel(K::Gaussian, x, y)
  ret::eltype(x) = 0
  @inbounds @simd for i in 1:length(x)
    temp = x[i] - y[i]
    ret -= temp * temp
  end
  return exp(K.gamma * ret)
end

Here’s is the kernel function I’m currently using:

function calculateKernelMatrix(K::AbstractKernel, data::Array{T}) where {T <: AbstractFloat}
  n = size(data)[2]
  mat::Array{T, 2} = zeros(T, n, n)
  @threads for j in 1:n
    @inbounds for i in j:n
      mat[i, j] = kernel(K, data[:, i], data[:, j])
      mat[j, i] = mat[i, j]
    end
  end
  return mat
end

The timings:

x = rand(Float32, (784, 10_000));
calculateKernelMatrix(DotProduct(), x[:, 1:10]); # for compilation
calculateKernelMatrix(Gaussian(Float32(1.0)), x[:, 1:10]); # for compilation
@time mat1 = calculateKernelMatrix(DotProduct(), x);
# 38.315531 seconds (100.01 M allocations: 304.387 GiB, 23.55% gc time)
@time mat2 = calculateKernelMatrix(Gaussian(Float32(1.0)), x);
# 38.029511 seconds (100.01 M allocations: 304.387 GiB, 24.31% gc time)

D does the same computations in about 1.5 - 1.6 seconds, I wonder if Julia can do any better. I also tried using LoopVectorization: @avx but it didn’t make much difference, maybe I wasn’t using it correctly. I also tried running Julia with julia -O3 --check-bounds=no but it made no difference.

Many Thanks

P.s. Chapel does it in ~ 9 seconds.

The number of allocations (and the fraction of time spent on garbage collection) should throw up some big red flags. Unless you use @views, each data[:, i] array slice creates a copy, which generates a ton of memory pressure. You may also run into inlining problems if your inner loop is a separate function, although it didn’t seem to have much impact in this case.

using LinearAlgebra, LoopVectorization

@inline function dotkernel(x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T,N}
    ret = zero(T)
    m = length(x)
    @avx for k in 1:m
        ret += x[k] * y[k]
    end
    return ret
end

@inbounds function kernelfn(kf::F, data::Array{T,2}) where {F, T}
    m, n = size(data)
    mat = zeros(T, n, n)
    @threads for j in 1:n
        @views for i in j:n
            mat[i,j] = kf(data[:, i], data[:, j])
        end
    end
    return Symmetric(mat, :L)
end

@time mat1 = calculateKernelMatrix(DotProduct(), x); # 204.092815 seconds (100.01 M allocations: 304.387 GiB, 42.29% gc time)

@time kernelfn(dotkernel, x) # 7.082682 seconds (165 allocations: 381.486 MiB, 1.69% gc time)
1 Like

Many thanks! Using your new approach on my machine I’m now getting ~ 2.4 seconds for the calculations. As you said improvements are due to inlining and using views. At the moment, I’m not seeing a difference when substituting @inbounds @simd for @avx. Also passing function kf::F instead of K::AbstractKernel doesn’t change the performance - even though at the moment your version has hard coded dotkernel in your kernelfn function - it doesn’t see to change the performance - probably be cause the compiler takes care of it.