# Speeding up a function

Hi all,

I have a function that needs some speeding up. I tried many things, here is the current state and a minimum working example

``````#

Random.seed!(123)
market_ids = repeat(1:1000,35)
delta = rand(Normal(1,5), 35000)
nu_bern = -1
randvar_nu = randn(35000,100)*5 .+ 6
randvar_nu_inattention= randn(35000,100)*5 .+ 6.5
mat_1 = similar(randvar_nu)
vec_1 = similar(delta)

function predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
@views num = (mat_1 .= exp.(randvar_nu .+ delta ))
@views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
end
vec_1 .= mean(num, dims = 2)
num .= exp.(randvar_nu_inattention .+ delta)
@views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
end
share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
return share
end

predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)

@btime predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
``````

I get this output:

Some notes:

• replace 35,000 with 10,000,000 if you want, that’s where I have to go.
• market_ids: right now it is just 1:1000 repeated, that’s not what it will be. The markets will have different sizes, so looping with 1000s will not work.
• I have nthreads() = 8 right now, I think the code will run on a computer with 260 cores+.
• mat_1 and vec_1 have no purpose other than being constructed once, as this function is run many times and thus I have less allocations overall.

Let me know if you have any ideas. Using DataFrames to subset the data etc. seems to be slower, I cannot get improvements with Tullio and LoopVectorization but I think it is me.

Many Thanks!

1 Like

Just one little thought as I am currently in hurry, you can probably speed it up by calculating `marked_ids .== i` only once in both of the threaded loops, like

``````   @threads for i in 1:length(market_ids)
end
``````
2 Likes

This already helped by x2 for 35,000 length of vector/matrix.

Do we have other ideas?

You may:

• put all the computation inside a function (no global variables)
• optimize a sequential version first (get rid of the `@threads`) until you eliminate all allocations
• profile the code `@profview` (and profile allocations)
• Once you have eliminated all allocations (pre_allocation, loops,…) then introduce multi-threading
• use \$ to interpolate `@benchmark` arguments
4 Likes

Does the bis version performs the same computation ?

``````using BenchmarkTools, Random, Distributions, Base.Threads

Random.seed!(123)
market_ids = repeat(1:1000,35)
delta = rand(Normal(1,5), 35000)
nu_bern = -1
randvar_nu = randn(35000,100)*5 .+ 6
randvar_nu_inattention= randn(35000,100)*5 .+ 6.5
mat_1 = similar(randvar_nu)
vec_1 = similar(delta)

function predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
@views num = (mat_1 .= exp.(randvar_nu .+ delta ))
@views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
end
vec_1 .= mean(num, dims = 2)
num .= exp.(randvar_nu_inattention .+ delta)
@views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
end
share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
return share
end

function predict_shares_bern_bis(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
@views num = (mat_1 .= exp.(randvar_nu .+ delta ))

for id in market_ids
@views num[id,:] .= num[id , :] ./ (sum(num[id,:],dims = 1) .+1)
end

vec_1 .= mean(num, dims = 2)
num .= exp.(randvar_nu_inattention .+ delta)

for id in market_ids
@views num[id,:] .= num[id , :] ./ (sum(num[id,:],dims = 1) .+1)
end
share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
return share
end

s1=predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
s2=predict_shares_bern_bis(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)

@assert s1 ≈ s2

@btime predict_shares_bern(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern)
@btime predict_shares_bern_bis(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern)
``````
``````  464.903 ms (1262128 allocations: 1.88 GiB)  #orig
41.977 ms (280030 allocations: 15.22 MiB)   #modified bis version
``````
2 Likes

Hi laurent,

Yes it does, the outputs are exactly the same. Thank you!

Which of your suggestions of the previous post are then still relevant? All, I think, right?

As regards putting all in a function: Do I also put the loops in a function, OR do I put the calculation inside the loops in a function?

You may check that this ones still compute the correct result and play with the arg size and thread numbers to see what is optimal for your computation:

``````using BenchmarkTools, Random, Distributions, Base.Threads

Random.seed!(123)
market_ids = repeat(1:1000,35)
delta = rand(Normal(1,5), 35000)
nu_bern = -1
randvar_nu = randn(35000,100)*5 .+ 6
randvar_nu_inattention= randn(35000,100)*5 .+ 6.5
mat_1 = similar(randvar_nu)
vec_1 = similar(delta)

function predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
@views num = (mat_1 .= exp.(randvar_nu .+ delta ))
@views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
end
vec_1 .= mean(num, dims = 2)
num .= exp.(randvar_nu_inattention .+ delta)
@views num[market_ids .== i,:] .= num[market_ids .== i , :] ./ (sum(num[market_ids .== i,:],dims = 1) .+1)
end
share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
return share
end

function predict_shares_bern_bis(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
@views num = (mat_1 .= exp.(randvar_nu .+ delta ))

for id in market_ids
@views num[id,:] .= num[id , :] ./ (sum(num[id,:],dims = 1) .+1)
end

vec_1 .= mean(num, dims = 2)
num .= exp.(randvar_nu_inattention .+ delta)

for id in market_ids
@views num[id,:] .= num[id , :] ./ (sum(num[id,:],dims = 1) .+1)
end
share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
return share
end

function predict_shares_bern_ter(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
@views num = (mat_1 .= exp.(randvar_nu .+ delta ))

nj = size(num,2)
for id in market_ids
@views sumj = sum(num[id,:])
for j ∈ 1:nj
num[id,j] = num[id , j] / (sumj+1)
end
end

vec_1 .= mean(num, dims = 2)
num .= exp.(randvar_nu_inattention .+ delta)

for id in market_ids
@views sumj = sum(num[id,:])
for j ∈ 1:nj
num[id,j] = num[id , j] / (sumj+1)
end
end
share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
return share
end

function predict_shares_bern_4(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
@views num = (mat_1 .= exp.(randvar_nu .+ delta ))

nj = size(num,2)
@views sumj = sum(num[id,:])
for j ∈ 1:nj
num[id,j] = num[id , j] / (sumj+1)
end
end

vec_1 .= mean(num, dims = 2)
num .= exp.(randvar_nu_inattention .+ delta)

@views sumj = sum(num[id,:])
for j ∈ 1:nj
num[id,j] = num[id , j] / (sumj+1)
end
end
share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
return share
end

s1=predict_shares_bern(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
s2=predict_shares_bern_bis(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
s3=predict_shares_bern_ter(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
s4=predict_shares_bern_ter(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)

@assert s1 ≈ s2
@assert s1 ≈ s3
@assert s1 ≈ s4

@btime predict_shares_bern(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern)
@btime predict_shares_bern_bis(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern)
@btime predict_shares_bern_ter(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern)
@btime predict_shares_bern_4(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern)
``````
``````  460.482 ms (1262128 allocations: 1.88 GiB) #orig
42.382 ms (280030 allocations: 15.22 MiB)  #bis
34.087 ms (30 allocations: 1.34 MiB)           #ter
26.706 ms (129 allocations: 1.35 MiB)        #4 (with 8 threads)
``````
1 Like

As you can see in the profile screen shot, the time is now consumed at lines 70 and 81.

2 Likes

Thanks a ton. Do you think that LoopVectorization could help in these lines as well? Or another function?

You often iterate over the columns of `num` in a given row (like `sum(num[id, :])`). Julia uses column-major ordering, so it might help to transpose the matrix and iterate over the rows in a given column.

3 Likes

Pretty sure that Loop vectorization can help a lot. You see in the flame graph that exponentials consume all the time. This function has a great arithmetic intensity hence allowing for good speedups. You may use turbo and then tturbo macros

1 Like

Better to just write a loop here rather than allocating multiple temporary arrays (for the index mask and the sum)? Loops are fast in Julia.

1 Like

Yes, this is what I have proposed.

Here is an attempt to accelerate the exponential parts. It seems that `@tturbo` does not compose well with `@threads`.

I would recommend to follow the transposition proposed by @barucden.

``````function foo!(num,market_ids)
@inbounds for id ∈ market_ids
@views s = sum(num[id,:])
@views num[id,:] ./= s+1
end
end

function predict_shares_bern_5(delta, randvar_nu, randvar_nu_inattention, mat_1, vec_1, market_ids, nu_bern)
num = copy(randvar_nu)
@tturbo num .= exp.(num .+ delta)
foo!(num,market_ids)
vec_1 .= mean(num, dims = 2)
@tturbo num .= exp.(randvar_nu_inattention .+ delta)
foo!(num,market_ids)
share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) + mean(num, dims = 2) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
return share
end
``````
``````  460.482 ms (1262128 allocations: 1.88 GiB) #orig
42.382 ms (280030 allocations: 15.22 MiB)  #bis
34.087 ms (30 allocations: 1.34 MiB)       #ter
26.706 ms (129 allocations: 1.35 MiB)      #4 (with 8 threads)
14.657 ms (32 allocations: 28.04 MiB)      #5 (@tturbo)
``````
3 Likes

`@tturbo` composes better with `Polyester.@batch`.

Or you could use `@threads`/`@spawn` with regular `@turbo`.

Also, `ThreadingUtilities.sleep_all_tasks()` will put LoopVectorization/Polyester’s tasks to sleep immediately, which can help if you need to follow up with base threading.

1 Like

Hi Chris,

I did try to use Polyster.jl but it is not faster in this case.

``````@inbounds function foo!(num,market_ids)
Polyester.@batch per=thread for id ∈ market_ids
@views s = sum(num[id,:])
@views num[id,:] ./= s+1
end
end
``````

I think one should try to use `@thread` conbined with `@turbo` as you proposed and try to fuse the parallel tasks (after the transposition).

Anyway, the SpUp proposed to the OP is already substantial (460 to 14 ms) on my machine so we should probably let him play with the result

1 Like

Hi guys,

Thanks a ton for all the help. The speedup has been very interesting to watch, I even wrote another function where I use the column major ordering and transposed matrices. With 200,000 observations, this increases the speed from 150ms to 77ms. Your last proposition, Laurent, runs for 160ms right now.

This is then the fastest candidate right now:

``````function predict_shares_bern_5_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, market_ids, nu_bern)
ni = size(mat_1_t,1)
nj = size(mat_1_t,2)
@tturbo for i in 1:ni, j in 1:nj
mat_1_t[i,j] =  exp(randvar_nu_t[i,j] + delta[j] )
end

@views sumj = sum(mat_1_t[:,id])
for i ∈ 1:ni
mat_1_t[i,id] = mat_1_t[i , id] / (sumj+1)
end
end

vec_1 .= vec(mean(mat_1_t, dims = 1))

@tturbo for i in 1:ni, j in 1:nj
mat_1_t[i,j] =  exp(randvar_nu_inattention_t[i,j] + delta[j] )
end

@views sumj = sum(mat_1_t[:,id])
for i ∈ 1:ni
mat_1_t[i,id] = mat_1_t[i,id] / (sumj+1)
end
end
share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) +vec(mean(mat_1_t, dims = 1)) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
return share
end
``````

I have tried using @turbo in a part of it, but it does not run.

``````@threads for id in market_ids
@views sumj = sum(mat_1_t[:,id])
@turbo for i ∈ 1:ni
mat_1_t[i,id] = mat_1_t[i,id] / (sumj+1)
end
end
``````

Thanks a lot already, this is now quite quick. Any last ideas for the last function i just posted? Incredible help!!

1 Like

I get better performance from

``````function foo_polyester!(num,market_ids)           @batch per=core for id ∈ market_ids
@views s = sum(num[id,:])
@views num[id,:] ./= s+1
end
end
``````

which is `predict_shares_bern_6`.

``````julia> @btime predict_shares_bern(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern);
547.625 ms (1262078 allocations: 1.88 GiB)

julia> @btime predict_shares_bern_bis(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern);
67.685 ms (280030 allocations: 15.22 MiB)

julia> @btime predict_shares_bern_ter(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern);
46.552 ms (30 allocations: 1.34 MiB)

julia> @btime predict_shares_bern_4(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern);
39.009 ms (78 allocations: 1.34 MiB)

julia> @btime predict_shares_bern_5(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern);
24.059 ms (35 allocations: 28.04 MiB)

julia> @btime predict_shares_bern_6(\$delta, \$randvar_nu, \$randvar_nu_inattention, \$mat_1, \$vec_1, \$market_ids, \$nu_bern);
15.691 ms (32 allocations: 28.04 MiB)

4
``````

My performance doesn’t really improve with more threads; 12ms is about the best it can do on my computer (with 18 cores).

3 Likes

Cool, I think that the computation becomes memory bound with more than 4 threads.

I guess that the last version of @ditfurth (transposed) has a better data layout and could allow for some more acceleration.

I should try to play with it a little later (I should stop now to procrastinate on my real job )

2 Likes

How come it doesn’t improve with more threads? I don’t really understand how to pick number of threads. Say I have 128 cores on one computer, how would I go about choosing them? Running the same function with x = 1:256 number of threads is quite exhausting, because I would have to restart julia each time in vscode, which I found to be annoying