Having issues speeding up code with multithreading

RobertGregg · July 12, 2023, 5:28am

I’m trying to speed up some code that takes subsets of columns from a data matrix and calculates a maximum score. The orginal code is complicated, so I’ve tried to simplfy it was much as possible here:

using Combinatorics

#Helper function to loop through pairs of data columns
allpairs(v) = Iterators.filter(i -> isless(i...), Iterators.product(v,v))

function maxScore(data)

    bestScore = 0.0

    for (i,j) in allpairs(axes(data,2))
        
        if isodd(i+j)
            currentScore = findscore(data,i,j)

            if bestScore < currentScore
                bestScore = currentScore
            end
        end
    end

    return bestScore
end

function findscore(data, i, j)
    
    bestScoreSubset = 0.0

    #Create a somewhat small subset
    subsets = mod1(i+j,10)

    for subset in powerset(1:subsets)

        currentScoreSubset = score(data,subset,j)

        if bestScoreSubset < currentScoreSubset
            bestScoreSubset = currentScoreSubset
        end

    end

    return bestScoreSubset
end

function score(data,subset,j)

    if isempty(subset)
        return 0.0
    end
    
    X = view(data,:,subset)
    y = view(data,:,j)

    b = X \ y

    ŷ = X*b

    return sum((yᵢ - ŷᵢ)^2 for (yᵢ, ŷᵢ) in zip(y,ŷ))
end

I tried using ThreadsX to parallelize the maxScore() function, but it actually made things slower.

ThreadsX version of maxScore

using ThreadsX

function maxScore_parallel(data)

    bestScore = ThreadsX.mapreduce(max, allpairs(axes(data,2))) do (i,j)
        if isodd(i+j)
            findscore(data,i,j)
        else
            0.0
        end
    end

    return bestScore
end

Here are some benchmarks as well

Benchmarks

julia> using Random, BenchmarkTools

julia> Threads.nthreads()
14

julia> Random.seed!(1);

julia> data = rand(100,20);

julia> @btime maxScore($data)
  174.879 ms (612340 allocations: 990.52 MiB)
19.720119456710123

julia> @btime maxScore_parallel($data)
  906.113 ms (622461 allocations: 990.90 MiB)
19.720119456710123

Any recommendations for improving parallelization would be much appreciated

LaurentPlagne · July 12, 2023, 6:39am

My 2 cents : you should try to remove all allocations from your computation before its parallelization.
For example, I guess that this operation

allocates.

gdalle · July 12, 2023, 7:26am

I wonder OP might be better off doing X \ Y, solving for all the columns at once.

LaurentPlagne · July 12, 2023, 8:00am

Actually I did not try to understand the code.

My experience is that it is really difficult to obtain interesting Speed-Ups with allocating tasks.

RobertGregg · July 12, 2023, 2:40pm

I definitely agree that avoiding allocations would helpful. In this case however I think that would be difficult because X changes size with every iteration, and both X and y change values. In the original code this size change is unpredictable.

Maybe there’s a good way allocate space for something variable in size but with a known maximum size?

I think I would also have to worry about race conditions with the allocated space.

LaurentPlagne · July 12, 2023, 2:51pm

If you have an idea of the max size you may allocate for the max size. This workspace should be private for each task… Not trivial.

If the max matrix size is small (less than 100x100) then StaticArrays could be an interesting option.

gdalle · July 12, 2023, 2:57pm

RobertGregg · July 12, 2023, 3:21pm

I think understand what you’re proposing but I’m not 100% sure. Do you mean solving X \ y for every column subset and saving the score? That would create a 2^N by N matrix which is probably only feasible for small problems.

Do you think the linear solve is causing ThreadsX.mapreduce to be slower?

LaurentPlagne · July 12, 2023, 3:37pm

FWIW your code exhibits quite different perf on my laptop (M1 Max). You could try to reduce the number of threads on your machine.

julia> include("src/titi.jl")
Threads.nthreads() = 8
  101.412 ms (639420 allocations: 1.02 GiB)  
  45.493 ms (642679 allocations: 1.02 GiB)    #ThreadsX
19.72011945671012

LaurentPlagne · July 12, 2023, 3:44pm

And all the time is spent in allocation required by the linear solve:

RobertGregg · July 12, 2023, 5:14pm

Interesting! I lowered the number of threads to 8 and got a very similar result. I even tried incrementing the number of threads and it looks like going from 1 to 2 threads about halves the time, but after that you get about the same result. It looks like there’s a point where too many threads just tanks performance.

gdalle · July 12, 2023, 6:24pm

Actually I was wrong, I thought only the y changed but since X changes too it’s harder

Definitely!
As the profiling by @LaurentPlagne shows you, most of the time is spent allocating memory (the yellow tiles). And multithreading works badly with memory-intensive codes.
On the bright side, it means that you have a wide margin of improvement accessible by simply optimizing your sequential code, without even worrying about parallelism.
And this improvement will in turn make multithreading more efficient.

I think your best bet for non-allocating linear solvers is this place:

You will probably need to allocate once per each size of X though

Benny · July 12, 2023, 8:47pm

Why is this exactly? Is heap allocation even multithreadable, I’d expect it to happen one at a time to prevent writing to overlapped memory, but I don’t know the actual implementation of Julia’s allocator.

Oscar_Smith · July 12, 2023, 8:52pm

There are two reasons. The first is that before Julia 1.10 (currently an alpha), GC is single threaded, so as you add more threads, you allocate faster and all the time ends up in the GC. The second problem is that for larger objects (bigger than a few kb), there is a lock around allocating since you end up with essentially calls to malloc for the allocation.

cgeoga · July 12, 2023, 9:43pm

It would presumably help to squeeze out all the allocations from your score function. I’ve been looking for an excuse to play with @Mason’s neat Bumper.jl as an alternative to writing technical multithreaded code that manually manages local buffers correctly, and this seems like a good opportunity.

Anyways, here is a version of your score function that is almost non-allocating:

function score(data,subset,j)
  isempty(subset) && return 0.0
  X = view(data,:,subset)
  y = view(data,:,j)
  @no_escape begin
    Xv  = alloc(Float64, size(X,1), size(X,2))
    b   = alloc(Float64, size(X,2))
    ŷ   = alloc(Float64, length(y))
    copyto!(Xv, X)
    Xvf = qr!(Xv)    # one alloc
    ldiv!(b, Xvf, y) # three allocs
    mul!(ŷ, X, b)
    sum((yᵢ - ŷᵢ)^2 for (yᵢ, ŷᵢ) in zip(y,ŷ))
  end
end

Emphasis on almost though. I played a little with slightly more thoughtful methods for the ldiv! because it doesn’t really seem like the best way to solve that least squares problem. But methods like ldiv!(Xvf.R, [...]) don’t exist because Xvf is a PtrArray. And that single alloc in the qr! call may also be hard to get rid of, although I bet somebody else knows how to do it. Maybe GenericLinearAlgebra.jl could help?

With that said, getting almost all the allocations out didn’t actually make the call to score itself any faster, and I didn’t bother trying with the threading since the function call still touches the heap a few times. But just passing this along in case somebody else knows how to get those out or has additional thoughts. And because it is pretty neat to seemingly get rid of all those allocations completely for free with Bumper.jl. I should think that getting all those heap allocations out will make your code scale better with threads.

bertschi · July 12, 2023, 9:49pm

Oh no, the GAL (global allocation lock) has infected Julia

nsajko · July 12, 2023, 9:59pm

I think that @Oscar_Smith is just referring to the malloc-internal synchronization.

RobertGregg · July 13, 2023, 1:31am

Wow Bumper.jl is pretty awesome! One thought I had was maybe combining it with QRupdate.jl which allows you to take a qr decomposition of the full dataset and then remove columns without recomputing the entire factorization.

Interestingly, using the score method with bumper did significantly improve performance when multi-threading, despite the similar results you mentioned:

using Random, BenchmarkTools

Threads.nthreads() # 7
Random.seed!(1);
data = rand(100,20);

@btime maxScore($data) #129.537 ms (612340 allocations: 990.52 MiB)
@btime maxScore_parallel($data) #53.769 ms (617649 allocations: 990.71 MiB)
@btime maxScore_parallel_bump($data) #14.049 ms (209736 allocations: 64.15 MiB)

I’ll call a ~10x improvement good for now

Mason · July 13, 2023, 2:19am

I’m really glad Bumper.jl was useful for you! Please do file issues if you run into bugs or have any feature requests.

RobertGregg · July 16, 2023, 6:14am

Update. After learning way to much about QR decomposition, I wrote a replacement for b = X \ y that has zero allocations

using LinearAlgebra

#Loop through lower triangular matrix
#the 1st axis is reversed because we're moving non-zeros *up* above diagonal
lowTriIter(ax1, ax2) = Iterators.filter(i -> first(i)>last(i), Iterators.product(Iterators.reverse(ax1),ax2))
lowTriIter(A::AbstractMatrix) = lowTriIter(axes(A)...)

function qless!(b, X, y, R, Rsub, subset)

    numCol = length(subset)

    #Copy columns over to avoid destroying R
    for (k,j) in enumerate(subset)
        @views Rsub[:,k] = R[:,j]
    end

    #Use givens rotations to zero out values below diagonal
    #Loop through lower triangular portion of Rsub
    @inbounds for (i,j) in lowTriIter(Rsub)        
        if Rsub[i,j] ≠ 0.0
            G,r = givens(Rsub,i-1,i,j)
            Rsub[i-1,j] = r
            Rsub[i,j] = 0.0

            for k in j+1:numCol
                (r1, r2) = Rsub[i-1,k], Rsub[i,k]
                Rsub[i-1,k] = G.c*r1 + G.s*r2
                Rsub[i,k] = -G.s*r1 + G.c*r2
            end
        end
    end

    #Needed to call correct ldiv! method
    Rv = UpperTriangular(view(Rsub,1:numCol,1:numCol))

    #Solve R'R*b = X'y
    mul!(b,X',y)
    ldiv!(Rv',b)
    ldiv!(Rv,b)

    return nothing
end

b, X, y, and subset are the same as before. R is upper-triangular matrix resulting from Q,R=qr(data). Rsub is a matrix where we can copy columns from R based on the subset.

julia> @btime $X\$y
  6.350 μs (41 allocations: 73.56 KiB)
4-element Vector{Float64}:
 0.11971500332311431
 0.22251862677083714
 0.3953695790986452
 0.12839479252581568

julia> @btime qless!($b,$X,$y,$R,$Rsub,$subset)
  2.133 μs (0 allocations: 0 bytes)

julia> b
4-element Vector{Float64}:
 0.11971500332311383
 0.22251862677083722
 0.3953695790986456
 0.12839479252581615

Topic		Replies	Views
Speeding up maximization Performance memory-allocation	4	674	January 24, 2022
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1958	January 20, 2023
Data structures for threaded computing Performance	23	2843	October 23, 2019
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2477	July 13, 2020
Julia multithreading is running slower than serial, can someone please explain why…? Performance multithreading , floops	14	705	March 24, 2023

Having issues speeding up code with multithreading

Related topics