Speeding up a function

ditfurth · August 14, 2023, 6:52am

Hi Elrod,

Thanks, yes, I did move the inbounds inside now. Thank you!

You are right, I actually overlooked this and have gotten wrong results since I abandoned code which used something like this mask

@threads for i in 1:length(market_ids)
        mask = market_ids .== i
        @views num[mask,:] .= num[mask, :] ./ (sum(num[mask,:],dims = 1) .+1)
    end

in the function foo.

So maybe I will have to do something like this. I apply foo_t now simultaneously to both, because it really is all of the subsetting that uses computing time a ton.

function foo_t!(num1, num2,market_ids)
    ni = size(num1,1)
    Threads.@threads  for id ∈ unique(market_ids)
        mask = market_ids .== id
        @inbounds for i in 1:ni
            @views s1 = sum(num1[i,mask]) + 1
            @views s2 = sum(num2[i,mask]) + 1
            @views num1[i, mask] .= num1[i,mask] ./ s1
            @views num2[i, mask] .= num2[i,mask] ./ s2
        end
    end
end
function bar_t!(mat_1_t,randvar_nu_t,delta)
    ni,nj = size(mat_1_t)
    Threads.@threads for j ∈ 1:nj
        @turbo for i ∈ 1:ni
            mat_1_t[i,j] =  exp(randvar_nu_t[i,j] + delta[j] )
        end
    end
end
@views function mean_t!(vec_1,mat_1_t)
    Threads.@threads for j ∈ eachindex(vec_1)
        vec_1[j] = mean(mat_1_t[:,j]) 
    end
end
function predict_shares_bern_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, mat_2_t, vec_1, vec_2, market_ids, nu_bern)
    bar_t!(mat_1_t,randvar_nu_t,delta)
    bar_t!(mat_2_t,randvar_nu_inattention_t,delta)
    foo_t!(mat_1_t, mat_2_t, market_ids)
    mean_t!(vec_1,mat_1_t)
    mean_t!(vec_2,mat_2_t)

    ee = (exp(nu_bern)/(1+exp(nu_bern)))
    vec_1 .= vec_1 .* ee + vec_2 .* (1 - ee)
    return vec_1
end

This gives me the following performance with 8 threads:

I was thinking about supplying a bit vector to keep overwriting, but it doesn’t work with having Threads.@threads on the outer loop in foo_t!, which makes sense.

LaurentPlagne · August 14, 2023, 12:42pm

→ Fundamental rule of optimization : have a solid way to assess the result correctness before entering the optimization process.

BTW I did try to make this assessment ( @assert s1 ≈ sxxx)

I do not have time to understand what your code is doing but I guess that you can take away some hints (allocation, profile, LoopVectorization,…) to continue your optimization journey.

ditfurth · August 14, 2023, 1:11pm

Yes, I will use what I have learnt. I used @assert too but somehow it always gave me positive answers. I don’t know why . Next time I will use results1 == results2 to be safe, which is what I did at the end of the day.

I learnt a lot from you, thanks!

LaurentPlagne · August 14, 2023, 1:18pm

The pb with the strict equality is that it forbids to use parallel reductions. Floating points arithmetic is not associative and a sum on vector of floats will produces different results depending on the algorithm (parallel or not). This does not mean that the parallel result is less accurate than a naive sequential loop (it is usually more accurate) but it emphasis the finite accuracy of FP operations.

P.S. just to avoid a confusion : this does not mean that the previous parallel code is correct. If threads concurrently write to the same memory location then you (may) obtain totally wrong and non deterministic results (data races)

Topic		Replies	Views
Help Improving Performance of a Loop Performance performance , loops	15	1080	February 16, 2021
Performance Tips for functions Performance	9	513	October 9, 2024
Improving Performance of a Loop Performance	9	420	November 3, 2021
Ways to speed up this code Performance	4	676	September 17, 2019
Very slow loop Performance loops	33	2294	January 24, 2021

Speeding up a function

Related topics