Speeding up a function

LaurentPlagne · August 11, 2023, 3:09pm

ditfurth · August 11, 2023, 3:22pm

Thanks, I am going to watch some videos about it as well.

ditfurth · August 11, 2023, 3:37pm

I do not know if you care, however, Laurent’s function below is more accurate, in the sense, that the result is always the same. If I replace Laurent’s foo function with Elrods foo_polyester! version, my function produces slightly different results every time I run it.
That being said, Elrod’s function improves speed from 75 to 60ms but I need the accuracy.

LaurentPlagne · August 11, 2023, 3:39pm

Here is an analogy that I use for // lecture :

Let’s imagine a tailoring workshop where

nc seamstresses make clothes and
nt carriers bring the fabric and deliver the clothes.

Two questions :

If the seamstresses are often out of work because the fabric does not arrive or the finished clothes are slow to be delivered: should we add seamstresses? (memory bound)

If a given type of clothes requires a lot of work time and little fabric and all the seamstresses are used to the maximum, should we add seamstresses? (compute bound)

Conclusion #1 :

Parallelizable algorithms are characterized by their arithmetic intensity which measures the ratio amount of work / amount of data to read and write from/to RAM.

Conclusion #2: the exponential is the wedding dress of parallelism

ditfurth · August 11, 2023, 3:42pm

hahahah I love that analogy, that makes a lot of sense to me. And arithmetic intensity depends on the function and on the number of threads to cores? Then, I will try to measure the intensity and maybe I can find a way to optimize the level of it.

LaurentPlagne · August 11, 2023, 3:48pm

The A.I. depends only on the function (not on the threads/core which are the seamstresses).

For a given algorithm/function one must minimize the RAM traffic and reuse the data that fit in cache (spatial and temporal data locality).
One should also try to group the computation in a minimum set of parallel sections (kernel fusion).

mihalybaci · August 11, 2023, 3:58pm

I don’t know if it will help speed, but as a style note (my style), @view is typically used line-by-line for indicating a view, while @views (plural) will transform all slices in a code block into views. So you could remove all the @views inside the function and instead delcare the whole thing use views with @views function predict_shares_bern(... unless there is a slice that you do want to make a copy.

nsajko · August 11, 2023, 4:09pm

Try eliminating the common subexpression (e = exp(nu_bern)) manually, I don’t think that Julia is able to do that yet.

I just submitted an issue to the Github, btw:

github.com/JuliaLang/julia

Common subexpression elimination doesn't happen even though the relevant call infers as consistent

opened 04:07PM - 11 Aug 23 UTC

nsajko

performance

exp.jl: ```julia using BenchmarkTools f(x) = exp(x)/(1+exp(x)) function …g(x) e = exp(x) e/(1 + e) end h!(f::F, v) where {F} = map!(f, v, v) @btime h!(f, v) setup=(v = rand(1 << 10);) @btime h!(g, v) setup=(v = rand(1 << 10);) nothing ``` ```julia-repl julia> include("/tmp/exp.jl") 9.648 μs (0 allocations: 0 bytes) 5.009 μs (0 allocations: 0 bytes) julia> Base.infer_effects(exp, (Float64,)) (+c,+e,+n,+t,+s,+m,+i) julia> @code_lowered f(0.3) CodeInfo( 1 ─ %1 = Main.exp(x) │ %2 = Main.exp(x) │ %3 = 1 + %2 │ %4 = %1 / %3 └── return %4 ) julia> @code_llvm f(0.3) ; Function Signature: f(Float64) ; @ /tmp/exp.jl:3 within `f` define double @julia_f_3383(double %"x::Float64") #0 { top: %0 = call double @j_exp_3387(double %"x::Float64") %1 = call double @j_exp_3387(double %"x::Float64") ; ┌ @ promotion.jl:422 within `+` @ float.jl:404 %2 = fadd double %1, 1.000000e+00 ; └ ; ┌ @ float.jl:407 within `/` %3 = fdiv double %0, %2 ; └ ret double %3 } julia> versioninfo() Julia Version 1.11.0-DEV.262 Commit 2c0e1d87e73 (2023-08-11 14:14 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics WORD_SIZE: 64 LLVM: libLLVM-15.0.7 (ORCJIT, znver2) Threads: 1 on 8 virtual cores ```

nsajko · August 11, 2023, 4:15pm

Note that 1 - \frac{exp(x)}{1 + {exp(x)}} is equivalent to simply \frac{1}{1 + {exp(x)}}.

Oscar_Smith · August 11, 2023, 4:37pm

Also it will be more accurate (especially for large negative x)

ditfurth · August 11, 2023, 8:09pm

Ok, got it, thanks!

ditfurth · August 11, 2023, 8:10pm

Ok, thanks both!

ditfurth · August 11, 2023, 8:10pm

Thanks, will look into it!

BioinfoBuggaBoo · August 13, 2023, 10:23pm

I know this isn’t an educational forum per se, but would you be willing to elaborate on, ‘use $ to interpolate @benchmark arguments’, by any chance?

nsajko · August 13, 2023, 10:43pm

Your question is fine.

First see the Avoid untyped global variables section in the performance tips for context.

Regarding @benchmark, specifically, see: Interpolating values into benchmark expressions and Quick Start.

Expression interpolation itself is documented in Julia’s Manual in the page on Metaprogramming: Interpolation.

But note that interpolation is only necessary if your benchmark depends on global variables: depending on, e.g., constants and functions is fine.

Oscar_Smith · August 13, 2023, 10:46pm

Also interpolating variables into the benchmark typically only matters if your function takes <100ns or so.

LaurentPlagne · August 13, 2023, 10:47pm

On my laptop (apple M1 Max) this version is more than 3x faster for n=200000 and looks accurate:

julia> main()
  93.055 ms (110 allocations: 7.64 MiB) #predict_shares_bern_5_t
  22.784 ms (306 allocations: 4.60 MiB) #predict_shares_bern_7_t

@views @inbounds function foo_t!(num,market_ids)
    Threads.@threads for id ∈ market_ids
        s = sum(num[:,id])
        num[:,id] ./= s+1
    end
end

@inbounds function bar_t!(mat_1_t,randvar_nu_t,delta)
    ni,nj = size(mat_1_t)
    Threads.@threads for j ∈ 1:nj
        @turbo for i ∈ 1:ni
            mat_1_t[i,j] =  exp(randvar_nu_t[i,j] + delta[j] )
        end
    end
end

@views @inbounds function mean_t!(vec_1,mat_1_t)
    Threads.@threads for j ∈ eachindex(vec_1)
        vec_1[j] = mean(mat_1_t[:,j]) 
    end
end


function predict_shares_bern_7_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, vec_2, market_ids, nu_bern)
    bar_t!(mat_1_t,randvar_nu_t,delta)
    foo_t!(mat_1_t, market_ids)
    mean_t!(vec_1,mat_1_t)
    bar_t!(mat_1_t,randvar_nu_inattention_t,delta)
    foo_t!(mat_1_t, market_ids)
    mean_t!(vec_2,mat_1_t)

    ee = (exp(nu_bern)/(1+exp(nu_bern)))
    vec_1 .= vec_1 .* ee + vec_2 .* (1 - ee)
    return vec_1
end

complete code

using BenchmarkTools, Random, Distributions, ProfileView, LoopVectorization

function predict_shares_bern_5_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, market_ids, nu_bern)
    ni = size(mat_1_t,1)
    nj = size(mat_1_t,2)
    @tturbo for i in 1:ni, j in 1:nj
        mat_1_t[i,j] =  exp(randvar_nu_t[i,j] + delta[j] )
    end

    Threads.@threads for id in market_ids
        @views sumj = sum(mat_1_t[:,id])
        for i ∈ 1:ni
            mat_1_t[i,id] = mat_1_t[i , id] / (sumj+1)
        end
    end

    vec_1 .= vec(mean(mat_1_t, dims = 1))

    @tturbo for i in 1:ni, j in 1:nj
        mat_1_t[i,j] =  exp(randvar_nu_inattention_t[i,j] + delta[j] )
    end
    
    Threads.@threads for id in market_ids
        @views sumj = sum(mat_1_t[:,id])
        for i ∈ 1:ni
            mat_1_t[i,id] = mat_1_t[i,id] / (sumj+1)
        end
    end
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) +vec(mean(mat_1_t, dims = 1)) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end



@views @inbounds  function foo_t!(num,market_ids)
    Threads.@threads for id ∈ market_ids
        @views s = sum(num[:,id])
        @views num[:,id] ./= s+1
    end
end

@inbounds function bar_t!(mat_1_t,randvar_nu_t,delta)
    ni,nj = size(mat_1_t)
    Threads.@threads for j ∈ 1:nj
        @turbo for i ∈ 1:ni
            mat_1_t[i,j] =  exp(randvar_nu_t[i,j] + delta[j] )
        end
    end
end

@views @inbounds function mean_t!(vec_1,mat_1_t)
    Threads.@threads for j ∈ eachindex(vec_1)
        vec_1[j] = mean(mat_1_t[:,j]) 
    end
end


function predict_shares_bern_7_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, vec_2, market_ids, nu_bern)
    bar_t!(mat_1_t,randvar_nu_t,delta)
    foo_t!(mat_1_t, market_ids)
    mean_t!(vec_1,mat_1_t)
    bar_t!(mat_1_t,randvar_nu_inattention_t,delta)
    foo_t!(mat_1_t, market_ids)
    mean_t!(vec_2,mat_1_t)

    ee = (exp(nu_bern)/(1+exp(nu_bern)))
    vec_1 .= vec_1 .* ee + vec_2 .* (1 - ee)
    return vec_1
end

function main()
    n = 200000
    markets::Int64 = 200
    Random.seed!(123)
    market_ids = repeat(1:1000,markets)
    market_ids_t = transpose(market_ids)
    delta = rand(Normal(1,5), n)
    log_share = rand(Normal(1,5), n)
    log_share_outside = rand(Normal(1,5), n)

    nu_bern = -1
    randvar_nu = randn(n,100)*5 .+ 6
    randvar_nu_t = transpose(randvar_nu)
    randvar_nu_inattention= randn(n,100)*5 .+ 6.5
    randvar_nu_inattention_t = transpose(randvar_nu_inattention)
    mat_1 = similar(randvar_nu)
    mat_1_t = similar(randvar_nu_t)
    vec_1 = similar(delta)
    vec_2 = similar(delta)

    s5t = predict_shares_bern_5_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, market_ids, nu_bern)
    s7t = predict_shares_bern_7_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, vec_2, market_ids, nu_bern)
    @assert s5t≈s7t

    @btime predict_shares_bern_5_t($delta, $randvar_nu_t, $randvar_nu_inattention_t, $mat_1_t, $vec_1, $market_ids, $nu_bern)
    @btime predict_shares_bern_7_t($delta, $randvar_nu_t, $randvar_nu_inattention_t, $mat_1_t, $vec_1, $vec_2, $market_ids, $nu_bern)

    nothing
end

hope it helps

Elrod · August 14, 2023, 3:17am

Those @inbounds outside the function definition do not work.

julia> @inbounds get(x,i) = x[i]
get (generic function with 1 method)

julia> get2(x,i) = @inbounds x[i]
get2 (generic function with 1 method)

julia> x = rand(3);

julia> @code_llvm debuginfo=:none get(x, 1)
; Function Signature: get(Array{Float64, 1}, Int64)
define double @julia_get_2224({}* noundef nonnull align 16 dereferenceable(40) %"x::Array", i64 signext %"i::Int64") #0 {
top:
  %0 = add i64 %"i::Int64", -1
  %1 = bitcast {}* %"x::Array" to { i8*, i64, i16, i16, i32 }*
  %.length_ptr = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
  %.length = load i64, i64* %.length_ptr, align 8
  %inbounds = icmp ult i64 %0, %.length
  br i1 %inbounds, label %idxend, label %oob

oob:                                              ; preds = %top
  %errorbox = alloca i64, align 8
  store i64 %"i::Int64", i64* %errorbox, align 8
  call void @ijl_bounds_error_ints({}* %"x::Array", i64* nonnull %errorbox, i64 1)
  unreachable

idxend:                                           ; preds = %top
  %2 = bitcast {}* %"x::Array" to double**
  %.data1 = load double*, double** %2, align 8
  %3 = getelementptr inbounds double, double* %.data1, i64 %0
  %.ref = load double, double* %3, align 8
  ret double %.ref
}

julia> @code_llvm debuginfo=:none get2(x, 1)
; Function Signature: get2(Array{Float64, 1}, Int64)
define double @julia_get2_2315({}* noundef nonnull align 16 dereferenceable(40) %"x::Array", i64 signext %"i::Int64") #0 {
top:
  %0 = add i64 %"i::Int64", -1
  %1 = bitcast {}* %"x::Array" to double**
  %.data1 = load double*, double** %1, align 8
  %2 = getelementptr inbounds double, double* %.data1, i64 %0
  %.ref = load double, double* %2, align 8
  ret double %.ref
}

ditfurth · August 14, 2023, 5:36am

Thanks a ton. I only have a problem with

@views @inbounds function foo2_t!(num,market_ids)
    Threads.@threads for id ∈ market_ids
        s = sum(num[:,id])
        num[:,id] ./= s+1
    end
end

This introduces deviations. Do you know why?

i.e., compare this

function foo1_t!(num,market_ids)
    @inbounds for id ∈ market_ids
        @views s = sum(num[:,id])
        @views num[:,id] ./= s+1
    end
end
@views @inbounds function foo2_t!(num,market_ids)
    Threads.@threads for id ∈ market_ids
        s = sum(num[:,id])
        num[:,id] ./= s+1
    end
end
mat_1_t = similar(randvar_nu_t)
mat_2_t = similar(randvar_nu_t)
bar_t!(mat_1_t,randvar_nu_t,delta)
bar_t!(mat_2_t,randvar_nu_t,delta)
mat_1_t == mat_2_t
foo1_t!(mat_1_t, market_ids)
foo2_t!(mat_2_t, market_ids)
mat_1_t == mat_2_t

You can see, that the two results are not the same at the end, but they are after the bar_t!() function.

Does not matter too much, you helped me a ton! I just thought it is weird that somehow Threads.@threads changes the result.

Elrod · August 14, 2023, 6:08am

Are all market_ids unique? If not, you can’t Threads.@threads them without special care.

Also, to repeat my ignored earlier comment: those @inbounds are not doing anything, because it does not penetrate function boundaries.
If you want @inbounds, move it inside.

Note also that Threads.@threads introduces a function boundary (a closure).
So move it inside the for loop (if you don’t simply delete it altogether).

Topic		Replies	Views
Help Improving Performance of a Loop Performance performance , loops	15	1083	February 16, 2021
Performance Tips for functions Performance	9	516	October 9, 2024
Improving Performance of a Loop Performance	9	420	November 3, 2021
Ways to speed up this code Performance	4	676	September 17, 2019
Very slow loop Performance loops	33	2297	January 24, 2021

Speeding up a function

Related topics