Speeding up a function

ditfurth · August 11, 2023, 7:11am

Hi all,

I have a function that needs some speeding up. I tried many things, here is the current state and a minimum working example

#
using BenchmarkTools, Random, Distributions, Base.Threads


Random.seed!(123)
market_ids = repeat(1:1000,35)
delta = rand(Normal(1,5), 35000)
nu_bern = -1
randvar_nu = randn(35000,100)*5 .+ 6
randvar_nu_inattention= randn(35000,100)*5 .+ 6.5
mat_1 = similar(randvar_nu)
vec_1 = similar(delta)

function predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
    @views num = (mat_1 .= exp.(randvar_nu .+ delta ))
    @threads for i in 1:length(market_ids)
        @views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
    end
    vec_1 .= mean(num, dims = 2)
    num .= exp.(randvar_nu_inattention .+ delta)
    @threads for i in 1:length(market_ids)
        @views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
    end
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end

predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)

@btime predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)

I get this output:

grafik

Some notes:

replace 35,000 with 10,000,000 if you want, that’s where I have to go.
market_ids: right now it is just 1:1000 repeated, that’s not what it will be. The markets will have different sizes, so looping with 1000s will not work.
I have nthreads() = 8 right now, I think the code will run on a computer with 260 cores+.
mat_1 and vec_1 have no purpose other than being constructed once, as this function is run many times and thus I have less allocations overall.

Let me know if you have any ideas. Using DataFrames to subset the data etc. seems to be slower, I cannot get improvements with Tullio and LoopVectorization but I think it is me.

Many Thanks!

tamasgal · August 11, 2023, 7:18am

Just one little thought as I am currently in hurry, you can probably speed it up by calculating marked_ids .== i only once in both of the threaded loops, like

   @threads for i in 1:length(market_ids)
       mask = market_ids .== i
       @views num[mask,:] .= num[mask, :] ./ (sum(num[mask,:],dims = 1) .+1)
   end

ditfurth · August 11, 2023, 8:05am

grafik
This already helped by x2 for 35,000 length of vector/matrix.

Do we have other ideas?

LaurentPlagne · August 11, 2023, 8:13am

You may:

put all the computation inside a function (no global variables)
optimize a sequential version first (get rid of the @threads) until you eliminate all allocations
profile the code @profview (and profile allocations)
Once you have eliminated all allocations (pre_allocation, loops,…) then introduce multi-threading
use $ to interpolate @benchmark arguments

LaurentPlagne · August 11, 2023, 8:31am

Does the bis version performs the same computation ?

using BenchmarkTools, Random, Distributions, Base.Threads


Random.seed!(123)
market_ids = repeat(1:1000,35)
delta = rand(Normal(1,5), 35000)
nu_bern = -1
randvar_nu = randn(35000,100)*5 .+ 6
randvar_nu_inattention= randn(35000,100)*5 .+ 6.5
mat_1 = similar(randvar_nu)
vec_1 = similar(delta)

function predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
    @views num = (mat_1 .= exp.(randvar_nu .+ delta ))
    @threads for i in 1:length(market_ids)
        @views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
    end
    vec_1 .= mean(num, dims = 2)
    num .= exp.(randvar_nu_inattention .+ delta)
    @threads for i in 1:length(market_ids)
        @views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
    end
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end

function predict_shares_bern_bis(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
    @views num = (mat_1 .= exp.(randvar_nu .+ delta ))

    for id in market_ids
        @views num[id,:] .= num[id , :] ./ (sum(num[id,:],dims = 1) .+1)
    end

    vec_1 .= mean(num, dims = 2)
    num .= exp.(randvar_nu_inattention .+ delta)

    for id in market_ids
        @views num[id,:] .= num[id , :] ./ (sum(num[id,:],dims = 1) .+1)
    end
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end

s1=predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
s2=predict_shares_bern_bis(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)

@assert s1 ≈ s2

@btime predict_shares_bern($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern)
@btime predict_shares_bern_bis($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern)

  464.903 ms (1262128 allocations: 1.88 GiB)  #orig
  41.977 ms (280030 allocations: 15.22 MiB)   #modified bis version

ditfurth · August 11, 2023, 8:39am

Hi laurent,

Yes it does, the outputs are exactly the same. Thank you!

grafik

Which of your suggestions of the previous post are then still relevant? All, I think, right?

As regards putting all in a function: Do I also put the loops in a function, OR do I put the calculation inside the loops in a function?

LaurentPlagne · August 11, 2023, 8:45am

You may check that this ones still compute the correct result and play with the arg size and thread numbers to see what is optimal for your computation:

using BenchmarkTools, Random, Distributions, Base.Threads


Random.seed!(123)
market_ids = repeat(1:1000,35)
delta = rand(Normal(1,5), 35000)
nu_bern = -1
randvar_nu = randn(35000,100)*5 .+ 6
randvar_nu_inattention= randn(35000,100)*5 .+ 6.5
mat_1 = similar(randvar_nu)
vec_1 = similar(delta)

function predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
    @views num = (mat_1 .= exp.(randvar_nu .+ delta ))
    @threads for i in 1:length(market_ids)
        @views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
    end
    vec_1 .= mean(num, dims = 2)
    num .= exp.(randvar_nu_inattention .+ delta)
    @threads for i in 1:length(market_ids)
        @views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
    end
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end

function predict_shares_bern_bis(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
    @views num = (mat_1 .= exp.(randvar_nu .+ delta ))

    for id in market_ids
        @views num[id,:] .= num[id , :] ./ (sum(num[id,:],dims = 1) .+1)
    end

    vec_1 .= mean(num, dims = 2)
    num .= exp.(randvar_nu_inattention .+ delta)

    for id in market_ids
        @views num[id,:] .= num[id , :] ./ (sum(num[id,:],dims = 1) .+1)
    end
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end

function predict_shares_bern_ter(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
    @views num = (mat_1 .= exp.(randvar_nu .+ delta ))

    nj = size(num,2)
    for id in market_ids
        @views sumj = sum(num[id,:])
        for j ∈ 1:nj
            num[id,j] = num[id , j] / (sumj+1)
        end
    end

    vec_1 .= mean(num, dims = 2)
    num .= exp.(randvar_nu_inattention .+ delta)

    for id in market_ids
        @views sumj = sum(num[id,:])
        for j ∈ 1:nj
            num[id,j] = num[id , j] / (sumj+1)
        end
    end
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end

function predict_shares_bern_4(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
    @views num = (mat_1 .= exp.(randvar_nu .+ delta ))

    nj = size(num,2)
    @threads for id in market_ids
        @views sumj = sum(num[id,:])
        for j ∈ 1:nj
            num[id,j] = num[id , j] / (sumj+1)
        end
    end

    vec_1 .= mean(num, dims = 2)
    num .= exp.(randvar_nu_inattention .+ delta)

    @threads for id in market_ids
        @views sumj = sum(num[id,:])
        for j ∈ 1:nj
            num[id,j] = num[id , j] / (sumj+1)
        end
    end
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end


s1=predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
s2=predict_shares_bern_bis(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
s3=predict_shares_bern_ter(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
s4=predict_shares_bern_ter(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)

@assert s1 ≈ s2
@assert s1 ≈ s3
@assert s1 ≈ s4

@btime predict_shares_bern($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern)
@btime predict_shares_bern_bis($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern)
@btime predict_shares_bern_ter($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern)
@btime predict_shares_bern_4($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern)

  460.482 ms (1262128 allocations: 1.88 GiB) #orig
  42.382 ms (280030 allocations: 15.22 MiB)  #bis
  34.087 ms (30 allocations: 1.34 MiB)           #ter
  26.706 ms (129 allocations: 1.35 MiB)        #4 (with 8 threads)

LaurentPlagne · August 11, 2023, 8:52am

LaurentPlagne · August 11, 2023, 8:58am

As you can see in the profile screen shot, the time is now consumed at lines 70 and 81.
You may try to use @threads on these broadcasts now.

ditfurth · August 11, 2023, 9:36am

Thanks a ton. Do you think that LoopVectorization could help in these lines as well? Or another function?

barucden · August 11, 2023, 9:43am

You often iterate over the columns of num in a given row (like sum(num[id, :])). Julia uses column-major ordering, so it might help to transpose the matrix and iterate over the rows in a given column.

LaurentPlagne · August 11, 2023, 10:17am

Pretty sure that Loop vectorization can help a lot. You see in the flame graph that exponentials consume all the time. This function has a great arithmetic intensity hence allowing for good speedups. You may use turbo and then tturbo macros

stevengj · August 11, 2023, 12:54pm

Better to just write a loop here rather than allocating multiple temporary arrays (for the index mask and the sum)? Loops are fast in Julia.

LaurentPlagne · August 11, 2023, 2:01pm

Yes, this is what I have proposed.

Here is an attempt to accelerate the exponential parts. It seems that @tturbo does not compose well with @threads.

I would recommend to follow the transposition proposed by @barucden.

function foo!(num,market_ids)
    @inbounds for id ∈ market_ids
        @views s = sum(num[id,:])
        @views num[id,:] ./= s+1
    end
end

function predict_shares_bern_5(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
    num = copy(randvar_nu)
    @tturbo num .= exp.(num .+ delta)
    foo!(num,market_ids)
    vec_1 .= mean(num, dims = 2)
    @tturbo num .= exp.(randvar_nu_inattention .+ delta)
    foo!(num,market_ids)
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end

  460.482 ms (1262128 allocations: 1.88 GiB) #orig
  42.382 ms (280030 allocations: 15.22 MiB)  #bis
  34.087 ms (30 allocations: 1.34 MiB)       #ter
  26.706 ms (129 allocations: 1.35 MiB)      #4 (with 8 threads)
  14.657 ms (32 allocations: 28.04 MiB)      #5 (@tturbo)

Elrod · August 11, 2023, 2:05pm

@tturbo composes better with Polyester.@batch.

Or you could use @threads/@spawn with regular @turbo.

Also, ThreadingUtilities.sleep_all_tasks() will put LoopVectorization/Polyester’s tasks to sleep immediately, which can help if you need to follow up with base threading.

LaurentPlagne · August 11, 2023, 2:18pm

Hi Chris,

I did try to use Polyster.jl but it is not faster in this case.

@inbounds function foo!(num,market_ids)
    Polyester.@batch per=thread for id ∈ market_ids
        @views s = sum(num[id,:])
        @views num[id,:] ./= s+1
    end
end

I think one should try to use @thread conbined with @turbo as you proposed and try to fuse the parallel tasks (after the transposition).

Anyway, the SpUp proposed to the OP is already substantial (460 to 14 ms) on my machine so we should probably let him play with the result

ditfurth · August 11, 2023, 2:29pm

Hi guys,

Thanks a ton for all the help. The speedup has been very interesting to watch, I even wrote another function where I use the column major ordering and transposed matrices. With 200,000 observations, this increases the speed from 150ms to 77ms. Your last proposition, Laurent, runs for 160ms right now.

This is then the fastest candidate right now:

function predict_shares_bern_5_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, market_ids, nu_bern)
    ni = size(mat_1_t,1)
    nj = size(mat_1_t,2)
    @tturbo for i in 1:ni, j in 1:nj
        mat_1_t[i,j] =  exp(randvar_nu_t[i,j] + delta[j] )
    end

    @threads for id in market_ids
        @views sumj = sum(mat_1_t[:,id])
        for i ∈ 1:ni
            mat_1_t[i,id] = mat_1_t[i , id] / (sumj+1)
        end
    end

    vec_1 .= vec(mean(mat_1_t, dims = 1))

    @tturbo for i in 1:ni, j in 1:nj
        mat_1_t[i,j] =  exp(randvar_nu_inattention_t[i,j] + delta[j] )
    end
    
    @threads for id in market_ids
        @views sumj = sum(mat_1_t[:,id])
        for i ∈ 1:ni
            mat_1_t[i,id] = mat_1_t[i,id] / (sumj+1)
        end
    end
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) +vec(mean(mat_1_t, dims = 1)) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end

I have tried using @turbo in a part of it, but it does not run.

@threads for id in market_ids
        @views sumj = sum(mat_1_t[:,id])
        @turbo for i ∈ 1:ni
            mat_1_t[i,id] = mat_1_t[i,id] / (sumj+1)
        end
    end

Thanks a lot already, this is now quite quick. Any last ideas for the last function i just posted? Incredible help!!

Elrod · August 11, 2023, 2:34pm

I get better performance from

function foo_polyester!(num,market_ids)           @batch per=core for id ∈ market_ids
               @views s = sum(num[id,:])
               @views num[id,:] ./= s+1
           end
       end

which is predict_shares_bern_6.
Using 4 threads:

julia> @btime predict_shares_bern($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern);
  547.625 ms (1262078 allocations: 1.88 GiB)

julia> @btime predict_shares_bern_bis($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern);
  67.685 ms (280030 allocations: 15.22 MiB)

julia> @btime predict_shares_bern_ter($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern);
  46.552 ms (30 allocations: 1.34 MiB)

julia> @btime predict_shares_bern_4($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern);
  39.009 ms (78 allocations: 1.34 MiB)

julia> @btime predict_shares_bern_5($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern);
  24.059 ms (35 allocations: 28.04 MiB)

julia> @btime predict_shares_bern_6($delta, $randvar_nu, $randvar_nu_inattention, $mat_1, $vec_1, $market_ids, $nu_bern);
  15.691 ms (32 allocations: 28.04 MiB)

julia> Threads.nthreads()
4

My performance doesn’t really improve with more threads; 12ms is about the best it can do on my computer (with 18 cores).

LaurentPlagne · August 11, 2023, 2:40pm

Cool, I think that the computation becomes memory bound with more than 4 threads.

I guess that the last version of @ditfurth (transposed) has a better data layout and could allow for some more acceleration.

I should try to play with it a little later (I should stop now to procrastinate on my real job )

ditfurth · August 11, 2023, 2:56pm

How come it doesn’t improve with more threads? I don’t really understand how to pick number of threads. Say I have 128 cores on one computer, how would I go about choosing them? Running the same function with x = 1:256 number of threads is quite exhausting, because I would have to restart julia each time in vscode, which I found to be annoying

Topic		Replies	Views
Need help on improving performance of a small subfunction Performance	21	889	May 4, 2020
Huge performance improvement by separating function? General Usage	15	1155	August 27, 2018
What's the problem with this simple multi-thread code? General Usage question	17	1146	March 11, 2022
Trouble Speeding up Vectorized Functions with Parallelism Performance	12	1349	May 28, 2018
How to speed up this simple code? Multithreading, simd, inbounds Performance	39	7024	January 29, 2019

Speeding up a function

Related topics