Slower @threads than serial for array computations

I am doing computations on an array using @threads in the Base.Threads package. I am getting VERY slow performance for the threaded version and when the matrix gets very large the threaded version is killed. Here is an example:

using Base.Threads

const printlock = SpinLock();
n = 30_000; x = zeros(Float64, (n, n));

# Serial finishes in 121.503948 sec
@time for i in 1:n
  if i % 100 == 0
    println("Row: $i")
  end
  for j in 1:n
    x[i, j] = i*j
  end
end

# Parallel finishes in 441.639847 sec
@time @threads for i in 1:n
  if i % 100 == 0
    lock(printlock) do
      Core.println("Row: $i")
    end
  end
  for j in 1:n
    x[i, j] = i*j
  end
end

threaded version is killed when n = 100_000, serial still runs. Solutions to this would be much appreciated.

Thank you

Don’t work in global scope: https://docs.julialang.org/en/v1/manual/performance-tips/index.html

Good point, when I wrap them into functions and get rid of the printing, I get faster threading performance:

using Base.Threads

# Serial
function testSerial(n::Int64)
  x = zeros(Float64, (n, n))
  @time for i in 1:n
    # if i % 100 == 0
    #   println("Row: $i")
    # end
    for j in 1:n
      x[i, j] = i*j
    end
  end
end
testSerial(100) # warmup
testSerial(60_000) # 18.629159 sec

# Parallel
# const printlock = SpinLock();
function testThreaded(n::Int64)
  x = zeros(Float64, (n, n))
  @time @threads for i in 1:n
    # if i % 100 == 0
    #   lock(printlock) do
    #     Core.println("Row: $i")
    #   end
    # end
    for j in 1:n
      x[i, j] = i*j
    end
  end
end
testThreaded(100) # warmup
testThreaded(60_000)  # 7.794507

I guess I was attempting to simplify the question with my original query. The actual problem I’m having is over calculating kernel matrices. For some reason, the threaded version it taking WAY longer, and as I said before, it crashes completely when the matrix becomes very large:

using Base.Threads

abstract type AbstractKernel end

# Kernel Dispatch
struct DotProduct <: AbstractKernel end

# DotProduct Version
function kernel(::Type{DotProduct}, x::Array{Float64, 1}, y::Array{Float64, 1})
  return x' * y
end

# Threaded Version
function calculateKernelMatrixThreaded(::Type{K}, data::Array{Float64, 2})::Array{Float64, 2} where {K <: AbstractKernel}
  n::Int64 = size(data)[1]
  kmat::Array{Float64, 2} = zeros(Float64, (n, n))
  @threads for i in 1:n
    for j in i:n
      kij::Float64 = kernel(K, data[i, :], data[j, :])
      kmat[i, j] = kij
      kmat[j, i] = kij
    end
  end
  return kmat
end

# Serial Version
function calculateKernelMatrixSerial(::Type{K}, data::Array{Float64, 2})::Array{Float64, 2} where {K <: AbstractKernel}
  n::Int64 = size(data)[1]
  kmat::Array{Float64, 2} = zeros(Float64, (n, n))
  for i in 1:n
    for j in i:n
      kij::Float64 = kernel(K, data[i, :], data[j, :])
      kmat[i, j] = kij
      kmat[j, i] = kij
    end
  end
  return kmat
end

# Warmup
n = 200; p = 10
x = rand(Float64, n, p);
calculateKernelMatrixSerial(DotProduct, x);
calculateKernelMatrixThreaded(DotProduct, x);

n = 10_000; p = 32
x = rand(Float64, n, p);
# Serial: 11.145330 sec
@time calculateKernelMatrixSerial(DotProduct, x);
# Threaded 90.470192 secs
@time calculateKernelMatrixThreaded(DotProduct, x);

Are you sure that you don’t rather want this?

@noinline function test_foo(n)
x=zeros(Float64, (n, n))
@threads for j=1:n 
for i=1:n 
x[i,j] = i*j
end
end
return x
end

Your inner loop should go over the first index whenever possible. These are adjacent in memory, which is good (you are memory bandwidth constrained, and hence should a large speedup, even single-threaded).

Second, it really sucks for your poor CPU if the same cache-line is accessed by different threads (at least one writing). In the good iteration order, each thread gets a contiguous piece of memory and you have at most nthreads() cache-lines that are in conflict (at the boundary between these memory pieces). In your original iteration order, each thread has n*nthreads() many contiguous regions, and you therefore get n*nthreads() many boundaries where conflicts can occur.

Thanks @foobar_lv2 that’s a great suggestion for the first bit of code. Actually looks like I need to re-read the performance tips, it’s grown a lot since I last saw it. Using the trick with the for loop following the column major definitely helps - in fact the serial is 3.9 sec and threaded is 1.1 sec. Unfortunately, it doesn’t apply to the kernel matrix calculation, the kernel matrix is symmetric so I store in k[i,j] and k[j,i].

You could maybe use the Symmetric type, as in

using LinearAlgebra
kmat = Symmetric(zeros(n, n))

and then fill only the upper triangle of kmat.data. Whether this is a good choice depends on what you’re planning to do with kmat later. A lot of BLAS operations are specialized for symmetric matrices, so if you’re planning to do e.g. matrix multiplication or an Eigenvalue decomposition later, it might be faster. If you’re planning on linearly iterating over the elements, Symmetric may be slower.

1 Like

@tkoolen I guess I was trying to stay away from Symmetric, it was slow when I used it some time ago, perhaps they’ve done work on it, because the serial version looses a couple of seconds from ~11 s to ~9 s with using Symmetric, but it doesn’t do much for the parallel version. I also noticed that you can’t assign to a non-diagonal element in a Symmetric matrix.

function calculateKernelMatrixSerial(::Type{K}, data::Array{Float64, 2})::Symmetric{Float64, Array{Float64, 2}} where {K <: AbstractKernel}
  n::Int64 = size(data)[1]
  mat::Array{Float64, 2} = zeros(Float64, (n, n))
  for j in 1:n
    for i in j:n
      kij::Float64 = kernel(K, data[i, :], data[j, :])
      mat[i, j] = kij
    end
  end
  kmat::Symmetric{Float64, Array{Float64, 2}} = Symmetric(mat, :L)
  return kmat
end
n = 10_000; p = 32
x = rand(Float64, n, p);
# Serial: 9.173580 sec
@time calculateKernelMatrixSerial(DotProduct, x);

It seems that the main issue with threaded version in the kernel calculation still exists - it is still very slow compared with the serial version.

Just FYI, all of those type annotations everywhere aren’t buying you any performance.

1 Like

For what it’s worth, I think you are actually not bandwidth constrained. It is worse:

julia> using BenchmarkTools
julia> n=5000;
julia> @btime fill(-1, n, n);
  89.504 ms (2 allocations: 190.73 MiB)
julia> z=fill(-1, n, n);
julia> @btime fill!(z, -2);
  22.303 ms (0 allocations: 0 bytes)
julia> n*n*8 / 22e-3
9.090909090909092e9

The second number gives you roughly memory bandwidth: about 10 GB/s, which is realistic. So how is it so much more expensive to allocate the array on my linux system?

What I think is happening is that julia sees a large array allocation, punts to glibc / system malloc, and glibc punts to the kernel by calling mmap. This is super fast.

However, I suspect that the new memory map is initially requested with a small / default pagesize (instead of using huge pages), and during the fill we need to fault in a lot of tiny pages.

On the other hand, in the fill! part, the pages have already been faulted in and we simply need to write (not counting TLB misses). If this theory is correct then we are spending most time as overhead in the kernel pagefault handler.

The kernel needs to zero the memory before we can fill it again, but this could at most explain a factor 2 (or 3, depending on linux internals i don’t know: pagefault could also cost a read and a write, from the copy-on-write, plus the write from fill. However, that should be mostly covered by some cache).

3 Likes

@tkoolen yeah, I use explicit ones them because I’m paranoid and the others because it’s part of a larger library. But the explicit ones e.g. Float64 will probably later be parameterized away to allow for any float type, e.g. Float32 etc.

Using @views on the kij line (and widening the kernel signature) is a pretty big win regardless of parallelization by the way.

@foobar_lv2, maybe I’m a bit slow, but how does your analysis relate to this problem? It’s an interesting observation, but I’d think that at least ideally, there shouldn’t be any memory allocation after the initial allocation of mat (or kmat), and runtime is not dominated by allocation. You showed that allocation and filling (in either benchmark) takes on the order of 100 ms, and we’re currently looking at runtimes on the order of seconds.

To me it would seem that the memory contention over data due to the fact that both the (threaded) outer loop and the inner loop access all rows of the matrix is the main impediment to performance improvements due to parallelization. A speedup can be gotten by using a transposed representation of x, which results in better cache locality in the single-threaded case, and additionally less memory contention in the multi-threaded case. Full code, including suggestions made earlier:

using Base.Threads
using LinearAlgebra

abstract type AbstractKernel end
struct DotProduct <: AbstractKernel end

function kernel(::Type{DotProduct}, x::AbstractVector, y::AbstractVector)
    return x' * y
  end

function calculateKernelMatrix(::Type{K}, data::AbstractMatrix) where {K <: AbstractKernel}
    n = size(data, 1)
    mat = zeros(Float64, n, n)
    @threads for j in 1 : n
      for i in j : n
        @views kij = kernel(K, data[i, :], data[j, :])
        @inbounds mat[i, j] = kij
      end
    end
    return Symmetric(mat, :L)
end

n = 10_000; p = 32
x = transpose(rand(Float64, p, n))

using BenchmarkTools
@btime calculateKernelMatrix(DotProduct, $x)

which results in 1.415 s (4 allocations: 762.94 MiB) when using a single thread and 664.076 ms (4 allocations: 762.94 MiB) when using 6 threads on my 6-core laptop.

1 Like

Or a bit faster still by representing x as a vector of SVector{p}'s and optimizing the kernel:

using Base.Threads
using LinearAlgebra
using StaticArrays

abstract type AbstractKernel end
struct DotProduct <: AbstractKernel end

function kernel(::Type{DotProduct}, x::StaticVector{N, T1}, y::StaticVector{N, T2}) where {N, T1, T2}
    # faster than dot(x, y)
    T = StaticArrays.arithmetic_closure(promote_type(T1, T2))
    ret = zero(T)
    @inbounds @simd for i = 1 : N
        @inbounds ret = muladd(x[i], y[i], ret)
    end
    return ret
end

function calculateKernelMatrix(::Type{K}, data) where {K <: AbstractKernel}
    n = size(data, 1)
    mat = zeros(Float64, n, n)
    @threads for j in 1 : n
      @inbounds for i in j : n
        mat[i, j] = kernel(K, data[i], data[j])
      end
    end
    return Symmetric(mat, :L)
end

n = 10_000; p = 32
x = [rand(SVector{p}) for _ = 1 : n]

using BenchmarkTools
@btime calculateKernelMatrix(DotProduct, $x)

492.337 ms (4 allocations: 762.94 MiB).

4 Likes

@tkoolen awesome, thanks a lot. I used a nested array rather than a 2D array for data last night as you have done above, this made a large difference, but the code here looks fully optimised, and also runs even faster.

However I wonder if there is a built in function to convert a Symmetric matrix back to regular matrix. It looks like the Symmetric matrix object contains .data = the original matrix used, and it would be easy to write such a function, but if there is a built-in, that would be preferable.

You can just use an ordinary conversion, either Matrix or Array:

julia> S = Symmetric(rand(5,5))
5×5 Symmetric{Float64,Array{Float64,2}}:
 0.233438  0.93464   0.337675  0.682922    0.289509  
 0.93464   0.452324  0.124545  0.895295    0.235742  
 0.337675  0.124545  0.939066  0.354193    0.678909  
 0.682922  0.895295  0.354193  0.652678    0.00675196
 0.289509  0.235742  0.678909  0.00675196  0.770918  

julia> M = Matrix(S)
5×5 Array{Float64,2}:
 0.233438  0.93464   0.337675  0.682922    0.289509  
 0.93464   0.452324  0.124545  0.895295    0.235742  
 0.337675  0.124545  0.939066  0.354193    0.678909  
 0.682922  0.895295  0.354193  0.652678    0.00675196
 0.289509  0.235742  0.678909  0.00675196  0.770918  

julia> M == S
true
1 Like

It is important to understand what the bottleneck is, and solve that root cause of slowness. Let’s take the example of filling a matrix with some computed values:

julia> @noinline function fillmat_good!(M)
       for j=1:size(M,2)
       for i=1:size(M,1)
       M[i,j] = i*j
       end
       end
       end;

julia> @noinline function fillmat_bad!(M)
       for i=1:size(M,1)
       for j=1:size(M,2)
       M[i,j] = i*j
       end
       end
       end;

julia> @noinline alloc_mat(n)=Matrix{Int}(undef, n, n);

julia> @noinline function alloc_fill_good(n)
       M=alloc_mat(n)
       fillmat_good!(M)
       M
       end;
julia> @noinline function alloc_fill_bad(n)
       M=alloc_mat(n)
       fillmat_bad!(M)
       M
       end

In order to figure out what is how expensive, we test these things out separately. First, array allocation (involving gc pressure):

julia> using BenchmarkTools
julia> n=5000;
julia> @btime alloc_mat(n);
  124.841 μs (2 allocations: 190.73 MiB)

That’s cheap. Next, filling the matrix:

julia> @btime fillmat_good!(M);
  22.680 ms (0 allocations: 0 bytes)
julia> @btime fillmat_bad!(M);
  227.534 ms (0 allocations: 0 bytes)

We observe that the right iteration order is 10x faster. If memory bandwidth was the only concern, we would expect 8x faster, because the bad iteration order wastes 7/8 integer slots in every memory transfer. 8x, 10x, close enough for me, compatible with memory bandwidth limit.

Now let’s observe in combination:

julia> @btime alloc_fill_good(n);
  108.312 ms (2 allocations: 190.73 MiB)
julia> @btime alloc_fill_bad(n);
  323.044 ms (2 allocations: 190.73 MiB)

Ok. The naive viewpoint (alloc_fill costs alloc + fill) is obviously wrong. Something else is going on, and that something is the dominant cost when using the right iteration order.

Note that data cache (M still hot from last iteration in benchmark loop) cannot be the thing, because M is too large for that to matter.

The essential and somewhat unobvious part is that we can get a speed-up of 5x by re-using the matrix, and this is entirely unrelated to garbage collection or allocator overhead (that we measured separately at 0.1 ms). I stated my theory about what the issue is (kernel overhead on too many pagefaults due to too small pagesize chosen by glibc; hot TLB in the benchmark loop is a second possibility), but the proper thing to do would be to perf, strace, and look at the memory map of the julia process.

The entire above post was triggered by me getting confused by seeing only a 3x speedup from the iteration order in alloc_fill, which looked entirely wrong to me (because the mental model of this being bandwidth constrained predicts 8x speedup).

4 Likes

Thanks for the explanation!

Edit: and yeah, I guess we’re now at the point where allocation and associated costs are a significant part of the time spent in the function.

For what it’s worth, I tried the obvious fix for the slowness of filling a large newly allocated array.

julia> function hugetlb_fill(v::T, n) where T
       isbitstype(T) || error("only bitstypes")
       sz = sizeof(T)*n
       ps = 0x0000000000200000
       asz = ( sz & ~(ps-1)) + ps * !iszero(sz & (ps-1))
       ptr = ccall(:mmap, Ptr{Nothing}, (Ptr{Nothing}, Csize_t, Cint, Cint, Cint, Cint), C_NULL, asz, 1|2, 2|0x20|0x40000|(21<<26) , -1, 0)
       res = unsafe_wrap(Array, convert(Ptr{T}, ptr), n)
       finalizer(res) do x
          _sz = sizeof(x)
          _ps = 0x0000000000200000
          _asz = ( _sz & ~(_ps-1)) + _ps * !iszero(_sz & (_ps-1))
          ccall(:munmap, Cint, (Ptr{Nothing}, Csize_t), pointer(x), _asz) end
       fill!(res, v)
       res
       end;
julia> n=25_000_000;
julia> sp=fill(1, n);
julia> hp = hugetlb_fill(1,n);
julia> GC.gc(); GC.enable(false);

julia> @time fill(1, n);
  0.095342 seconds (6 allocations: 190.735 MiB)
julia> @time fill(1, n);
  0.092669 seconds (6 allocations: 190.735 MiB)

julia> @time hugetlb_fill(1, n);
  0.036241 seconds (5 allocations: 240 bytes)
julia> @time hugetlb_fill(1, n);
  0.036939 seconds (5 allocations: 240 bytes)

julia> GC.enable(true); GC.gc();  GC.gc(); 
julia> using BenchmarkTools
julia> @btime fill(1, n);
  102.046 ms (2 allocations: 190.73 MiB)
julia> @btime hugetlb_fill(1, n);
ERROR: ArgumentError: unsafe_wrap: pointer 0xffffffffffffffff is not properly aligned to 8 bytes

We see that the pagesize is indeed a large part of the issue (22 ms memset on resident memory vs 37 ms for memset + pagefaults looks ok to me, and much better than the 90-100 ms using tiny pages).

Unfortunately, julia does not use the mmapped memory in its used space accounting, which leads to the garbage collector not running often enough, hence the finalizer / munmap not getting triggered often enough and hence the error (kernel ran out of huge pages and the mmap failed, returning -1).

edit: Ok, maybe my fault. # echo always >/sys/kernel/mm/transparent_hugepage/enabled fixed the performance problem for me (I had it on madvise, which julia doesn’t use).

2 Likes

EDIT: Okay, to my embarrassment, I didn’t know how to read these config files. If I understand correctly, there are three options, with brackets surrounding the currently selected option.

$ cat /sys/kernel/mm/transparent_hugepage/enabled 
[always] madvise never