Speeding up a function

1 Like

Thanks, I am going to watch some videos about it as well.

I do not know if you care, however, Laurent’s function below is more accurate, in the sense, that the result is always the same. If I replace Laurent’s foo function with Elrods foo_polyester! version, my function produces slightly different results every time I run it.
That being said, Elrod’s function improves speed from 75 to 60ms but I need the accuracy.

Here is an analogy that I use for // lecture :

Let’s imagine a tailoring workshop where

  • nc seamstresses make clothes and
  • nt carriers bring the fabric and deliver the clothes.

Two questions :

If the seamstresses are often out of work because the fabric does not arrive or the finished clothes are slow to be delivered: should we add seamstresses? (memory bound)

If a given type of clothes requires a lot of work time and little fabric and all the seamstresses are used to the maximum, should we add seamstresses? (compute bound)

Conclusion #1 :

Parallelizable algorithms are characterized by their arithmetic intensity which measures the ratio amount of work / amount of data to read and write from/to RAM.

Conclusion #2: the exponential is the wedding dress of parallelism :wink:

3 Likes

hahahah I love that analogy, that makes a lot of sense to me. And arithmetic intensity depends on the function and on the number of threads to cores? Then, I will try to measure the intensity and maybe I can find a way to optimize the level of it.

The A.I. depends only on the function (not on the threads/core which are the seamstresses).

For a given algorithm/function one must minimize the RAM traffic and reuse the data that fit in cache (spatial and temporal data locality).
One should also try to group the computation in a minimum set of parallel sections (kernel fusion).

1 Like

I don’t know if it will help speed, but as a style note (my style), @view is typically used line-by-line for indicating a view, while @views (plural) will transform all slices in a code block into views. So you could remove all the @views inside the function and instead delcare the whole thing use views with @views function predict_shares_bern(... unless there is a slice that you do want to make a copy.

2 Likes

Try eliminating the common subexpression (e = exp(nu_bern)) manually, I don’t think that Julia is able to do that yet.

I just submitted an issue to the Github, btw:

Note that 1 - \frac{exp(x)}{1 + {exp(x)}} is equivalent to simply \frac{1}{1 + {exp(x)}}.

2 Likes

Also it will be more accurate (especially for large negative x)

3 Likes

Ok, got it, thanks!

Ok, thanks both!

Thanks, will look into it!

I know this isn’t an educational forum per se, but would you be willing to elaborate on, ‘use $ to interpolate @benchmark arguments’, by any chance?

Your question is fine.

First see the Avoid untyped global variables section in the performance tips for context.

Regarding @benchmark, specifically, see: Interpolating values into benchmark expressions and Quick Start.

Expression interpolation itself is documented in Julia’s Manual in the page on Metaprogramming: Interpolation.

But note that interpolation is only necessary if your benchmark depends on global variables: depending on, e.g., constants and functions is fine.

3 Likes

Also interpolating variables into the benchmark typically only matters if your function takes <100ns or so.

2 Likes

On my laptop (apple M1 Max) this version is more than 3x faster for n=200000 and looks accurate:

julia> main()
  93.055 ms (110 allocations: 7.64 MiB) #predict_shares_bern_5_t
  22.784 ms (306 allocations: 4.60 MiB) #predict_shares_bern_7_t
@views @inbounds function foo_t!(num,market_ids)
    Threads.@threads for id ∈ market_ids
        s = sum(num[:,id])
        num[:,id] ./= s+1
    end
end

@inbounds function bar_t!(mat_1_t,randvar_nu_t,delta)
    ni,nj = size(mat_1_t)
    Threads.@threads for j ∈ 1:nj
        @turbo for i ∈ 1:ni
            mat_1_t[i,j] =  exp(randvar_nu_t[i,j] + delta[j] )
        end
    end
end

@views @inbounds function mean_t!(vec_1,mat_1_t)
    Threads.@threads for j ∈ eachindex(vec_1)
        vec_1[j] = mean(mat_1_t[:,j]) 
    end
end


function predict_shares_bern_7_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, vec_2, market_ids, nu_bern)
    bar_t!(mat_1_t,randvar_nu_t,delta)
    foo_t!(mat_1_t, market_ids)
    mean_t!(vec_1,mat_1_t)
    bar_t!(mat_1_t,randvar_nu_inattention_t,delta)
    foo_t!(mat_1_t, market_ids)
    mean_t!(vec_2,mat_1_t)

    ee = (exp(nu_bern)/(1+exp(nu_bern)))
    vec_1 .= vec_1 .* ee + vec_2 .* (1 - ee)
    return vec_1
end
complete code
using BenchmarkTools, Random, Distributions, ProfileView, LoopVectorization

function predict_shares_bern_5_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, market_ids, nu_bern)
    ni = size(mat_1_t,1)
    nj = size(mat_1_t,2)
    @tturbo for i in 1:ni, j in 1:nj
        mat_1_t[i,j] =  exp(randvar_nu_t[i,j] + delta[j] )
    end

    Threads.@threads for id in market_ids
        @views sumj = sum(mat_1_t[:,id])
        for i ∈ 1:ni
            mat_1_t[i,id] = mat_1_t[i , id] / (sumj+1)
        end
    end

    vec_1 .= vec(mean(mat_1_t, dims = 1))

    @tturbo for i in 1:ni, j in 1:nj
        mat_1_t[i,j] =  exp(randvar_nu_inattention_t[i,j] + delta[j] )
    end
    
    Threads.@threads for id in market_ids
        @views sumj = sum(mat_1_t[:,id])
        for i ∈ 1:ni
            mat_1_t[i,id] = mat_1_t[i,id] / (sumj+1)
        end
    end
    share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) +vec(mean(mat_1_t, dims = 1)) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
    return share
end



@views @inbounds  function foo_t!(num,market_ids)
    Threads.@threads for id ∈ market_ids
        @views s = sum(num[:,id])
        @views num[:,id] ./= s+1
    end
end

@inbounds function bar_t!(mat_1_t,randvar_nu_t,delta)
    ni,nj = size(mat_1_t)
    Threads.@threads for j ∈ 1:nj
        @turbo for i ∈ 1:ni
            mat_1_t[i,j] =  exp(randvar_nu_t[i,j] + delta[j] )
        end
    end
end

@views @inbounds function mean_t!(vec_1,mat_1_t)
    Threads.@threads for j ∈ eachindex(vec_1)
        vec_1[j] = mean(mat_1_t[:,j]) 
    end
end


function predict_shares_bern_7_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, vec_2, market_ids, nu_bern)
    bar_t!(mat_1_t,randvar_nu_t,delta)
    foo_t!(mat_1_t, market_ids)
    mean_t!(vec_1,mat_1_t)
    bar_t!(mat_1_t,randvar_nu_inattention_t,delta)
    foo_t!(mat_1_t, market_ids)
    mean_t!(vec_2,mat_1_t)

    ee = (exp(nu_bern)/(1+exp(nu_bern)))
    vec_1 .= vec_1 .* ee + vec_2 .* (1 - ee)
    return vec_1
end

function main()
    n = 200000
    markets::Int64 = 200
    Random.seed!(123)
    market_ids = repeat(1:1000,markets)
    market_ids_t = transpose(market_ids)
    delta = rand(Normal(1,5), n)
    log_share = rand(Normal(1,5), n)
    log_share_outside = rand(Normal(1,5), n)

    nu_bern = -1
    randvar_nu = randn(n,100)*5 .+ 6
    randvar_nu_t = transpose(randvar_nu)
    randvar_nu_inattention= randn(n,100)*5 .+ 6.5
    randvar_nu_inattention_t = transpose(randvar_nu_inattention)
    mat_1 = similar(randvar_nu)
    mat_1_t = similar(randvar_nu_t)
    vec_1 = similar(delta)
    vec_2 = similar(delta)

    s5t = predict_shares_bern_5_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, market_ids, nu_bern)
    s7t = predict_shares_bern_7_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, vec_2, market_ids, nu_bern)
    @assert s5t≈s7t

    @btime predict_shares_bern_5_t($delta, $randvar_nu_t, $randvar_nu_inattention_t, $mat_1_t, $vec_1, $market_ids, $nu_bern)
    @btime predict_shares_bern_7_t($delta, $randvar_nu_t, $randvar_nu_inattention_t, $mat_1_t, $vec_1, $vec_2, $market_ids, $nu_bern)

    nothing
end

hope it helps

2 Likes

Those @inbounds outside the function definition do not work.

julia> @inbounds get(x,i) = x[i]
get (generic function with 1 method)

julia> get2(x,i) = @inbounds x[i]
get2 (generic function with 1 method)

julia> x = rand(3);

julia> @code_llvm debuginfo=:none get(x, 1)
; Function Signature: get(Array{Float64, 1}, Int64)
define double @julia_get_2224({}* noundef nonnull align 16 dereferenceable(40) %"x::Array", i64 signext %"i::Int64") #0 {
top:
  %0 = add i64 %"i::Int64", -1
  %1 = bitcast {}* %"x::Array" to { i8*, i64, i16, i16, i32 }*
  %.length_ptr = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
  %.length = load i64, i64* %.length_ptr, align 8
  %inbounds = icmp ult i64 %0, %.length
  br i1 %inbounds, label %idxend, label %oob

oob:                                              ; preds = %top
  %errorbox = alloca i64, align 8
  store i64 %"i::Int64", i64* %errorbox, align 8
  call void @ijl_bounds_error_ints({}* %"x::Array", i64* nonnull %errorbox, i64 1)
  unreachable

idxend:                                           ; preds = %top
  %2 = bitcast {}* %"x::Array" to double**
  %.data1 = load double*, double** %2, align 8
  %3 = getelementptr inbounds double, double* %.data1, i64 %0
  %.ref = load double, double* %3, align 8
  ret double %.ref
}

julia> @code_llvm debuginfo=:none get2(x, 1)
; Function Signature: get2(Array{Float64, 1}, Int64)
define double @julia_get2_2315({}* noundef nonnull align 16 dereferenceable(40) %"x::Array", i64 signext %"i::Int64") #0 {
top:
  %0 = add i64 %"i::Int64", -1
  %1 = bitcast {}* %"x::Array" to double**
  %.data1 = load double*, double** %1, align 8
  %2 = getelementptr inbounds double, double* %.data1, i64 %0
  %.ref = load double, double* %2, align 8
  ret double %.ref
}
3 Likes

Thanks a ton. I only have a problem with

@views @inbounds function foo2_t!(num,market_ids)
    Threads.@threads for id ∈ market_ids
        s = sum(num[:,id])
        num[:,id] ./= s+1
    end
end

This introduces deviations. Do you know why?

i.e., compare this

function foo1_t!(num,market_ids)
    @inbounds for id ∈ market_ids
        @views s = sum(num[:,id])
        @views num[:,id] ./= s+1
    end
end
@views @inbounds function foo2_t!(num,market_ids)
    Threads.@threads for id ∈ market_ids
        s = sum(num[:,id])
        num[:,id] ./= s+1
    end
end
mat_1_t = similar(randvar_nu_t)
mat_2_t = similar(randvar_nu_t)
bar_t!(mat_1_t,randvar_nu_t,delta)
bar_t!(mat_2_t,randvar_nu_t,delta)
mat_1_t == mat_2_t
foo1_t!(mat_1_t, market_ids)
foo2_t!(mat_2_t, market_ids)
mat_1_t == mat_2_t

You can see, that the two results are not the same at the end, but they are after the bar_t!() function.

Does not matter too much, you helped me a ton! I just thought it is weird that somehow Threads.@threads changes the result.

Are all market_ids unique? If not, you can’t Threads.@threads them without special care.

Also, to repeat my ignored earlier comment: those @inbounds are not doing anything, because it does not penetrate function boundaries.
If you want @inbounds, move it inside.

Note also that Threads.@threads introduces a function boundary (a closure).
So move it inside the for loop (if you don’t simply delete it altogether).

2 Likes