Thanks, I am going to watch some videos about it as well.
I do not know if you care, however, Laurent’s function below is more accurate, in the sense, that the result is always the same. If I replace Laurent’s foo function with Elrods foo_polyester! version, my function produces slightly different results every time I run it.
That being said, Elrod’s function improves speed from 75 to 60ms but I need the accuracy.
Here is an analogy that I use for // lecture :
Let’s imagine a tailoring workshop where
nc
seamstresses make clothes andnt
carriers bring the fabric and deliver the clothes.
Two questions :
If the seamstresses are often out of work because the fabric does not arrive or the finished clothes are slow to be delivered: should we add seamstresses? (memory bound)
If a given type of clothes requires a lot of work time and little fabric and all the seamstresses are used to the maximum, should we add seamstresses? (compute bound)
Conclusion #1 :
Parallelizable algorithms are characterized by their arithmetic intensity
which measures the ratio amount of work / amount of data to read and write from/to RAM.
Conclusion #2: the exponential is the wedding dress of parallelism
hahahah I love that analogy, that makes a lot of sense to me. And arithmetic intensity depends on the function and on the number of threads to cores? Then, I will try to measure the intensity and maybe I can find a way to optimize the level of it.
The A.I. depends only on the function (not on the threads/core which are the seamstresses).
For a given algorithm/function one must minimize the RAM traffic and reuse the data that fit in cache (spatial and temporal data locality).
One should also try to group the computation in a minimum set of parallel sections (kernel fusion).
I don’t know if it will help speed, but as a style note (my style), @view
is typically used line-by-line for indicating a view, while @views
(plural) will transform all slices in a code block into views. So you could remove all the @views
inside the function and instead delcare the whole thing use views with @views function predict_shares_bern(...
unless there is a slice that you do want to make a copy.
Try eliminating the common subexpression (e = exp(nu_bern)
) manually, I don’t think that Julia is able to do that yet.
I just submitted an issue to the Github, btw:
Note that 1 - \frac{exp(x)}{1 + {exp(x)}} is equivalent to simply \frac{1}{1 + {exp(x)}}.
Also it will be more accurate (especially for large negative x)
Ok, got it, thanks!
Ok, thanks both!
Thanks, will look into it!
I know this isn’t an educational forum per se, but would you be willing to elaborate on, ‘use $ to interpolate @benchmark arguments’, by any chance?
Your question is fine.
First see the Avoid untyped global variables section in the performance tips for context.
Regarding @benchmark
, specifically, see: Interpolating values into benchmark expressions and Quick Start.
Expression interpolation itself is documented in Julia’s Manual in the page on Metaprogramming: Interpolation.
But note that interpolation is only necessary if your benchmark depends on global variables: depending on, e.g., constants and functions is fine.
Also interpolating variables into the benchmark typically only matters if your function takes <100ns or so.
On my laptop (apple M1 Max) this version is more than 3x faster for n=200000 and looks accurate:
julia> main()
93.055 ms (110 allocations: 7.64 MiB) #predict_shares_bern_5_t
22.784 ms (306 allocations: 4.60 MiB) #predict_shares_bern_7_t
@views @inbounds function foo_t!(num,market_ids)
Threads.@threads for id ∈ market_ids
s = sum(num[:,id])
num[:,id] ./= s+1
end
end
@inbounds function bar_t!(mat_1_t,randvar_nu_t,delta)
ni,nj = size(mat_1_t)
Threads.@threads for j ∈ 1:nj
@turbo for i ∈ 1:ni
mat_1_t[i,j] = exp(randvar_nu_t[i,j] + delta[j] )
end
end
end
@views @inbounds function mean_t!(vec_1,mat_1_t)
Threads.@threads for j ∈ eachindex(vec_1)
vec_1[j] = mean(mat_1_t[:,j])
end
end
function predict_shares_bern_7_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, vec_2, market_ids, nu_bern)
bar_t!(mat_1_t,randvar_nu_t,delta)
foo_t!(mat_1_t, market_ids)
mean_t!(vec_1,mat_1_t)
bar_t!(mat_1_t,randvar_nu_inattention_t,delta)
foo_t!(mat_1_t, market_ids)
mean_t!(vec_2,mat_1_t)
ee = (exp(nu_bern)/(1+exp(nu_bern)))
vec_1 .= vec_1 .* ee + vec_2 .* (1 - ee)
return vec_1
end
complete code
using BenchmarkTools, Random, Distributions, ProfileView, LoopVectorization
function predict_shares_bern_5_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, market_ids, nu_bern)
ni = size(mat_1_t,1)
nj = size(mat_1_t,2)
@tturbo for i in 1:ni, j in 1:nj
mat_1_t[i,j] = exp(randvar_nu_t[i,j] + delta[j] )
end
Threads.@threads for id in market_ids
@views sumj = sum(mat_1_t[:,id])
for i ∈ 1:ni
mat_1_t[i,id] = mat_1_t[i , id] / (sumj+1)
end
end
vec_1 .= vec(mean(mat_1_t, dims = 1))
@tturbo for i in 1:ni, j in 1:nj
mat_1_t[i,j] = exp(randvar_nu_inattention_t[i,j] + delta[j] )
end
Threads.@threads for id in market_ids
@views sumj = sum(mat_1_t[:,id])
for i ∈ 1:ni
mat_1_t[i,id] = mat_1_t[i,id] / (sumj+1)
end
end
share = (vec_1 .= vec_1 .* (exp(nu_bern)/(1+exp(nu_bern))) +vec(mean(mat_1_t, dims = 1)) .* (1 - (exp(nu_bern)/(1+exp(nu_bern)))))
return share
end
@views @inbounds function foo_t!(num,market_ids)
Threads.@threads for id ∈ market_ids
@views s = sum(num[:,id])
@views num[:,id] ./= s+1
end
end
@inbounds function bar_t!(mat_1_t,randvar_nu_t,delta)
ni,nj = size(mat_1_t)
Threads.@threads for j ∈ 1:nj
@turbo for i ∈ 1:ni
mat_1_t[i,j] = exp(randvar_nu_t[i,j] + delta[j] )
end
end
end
@views @inbounds function mean_t!(vec_1,mat_1_t)
Threads.@threads for j ∈ eachindex(vec_1)
vec_1[j] = mean(mat_1_t[:,j])
end
end
function predict_shares_bern_7_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, vec_2, market_ids, nu_bern)
bar_t!(mat_1_t,randvar_nu_t,delta)
foo_t!(mat_1_t, market_ids)
mean_t!(vec_1,mat_1_t)
bar_t!(mat_1_t,randvar_nu_inattention_t,delta)
foo_t!(mat_1_t, market_ids)
mean_t!(vec_2,mat_1_t)
ee = (exp(nu_bern)/(1+exp(nu_bern)))
vec_1 .= vec_1 .* ee + vec_2 .* (1 - ee)
return vec_1
end
function main()
n = 200000
markets::Int64 = 200
Random.seed!(123)
market_ids = repeat(1:1000,markets)
market_ids_t = transpose(market_ids)
delta = rand(Normal(1,5), n)
log_share = rand(Normal(1,5), n)
log_share_outside = rand(Normal(1,5), n)
nu_bern = -1
randvar_nu = randn(n,100)*5 .+ 6
randvar_nu_t = transpose(randvar_nu)
randvar_nu_inattention= randn(n,100)*5 .+ 6.5
randvar_nu_inattention_t = transpose(randvar_nu_inattention)
mat_1 = similar(randvar_nu)
mat_1_t = similar(randvar_nu_t)
vec_1 = similar(delta)
vec_2 = similar(delta)
s5t = predict_shares_bern_5_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, market_ids, nu_bern)
s7t = predict_shares_bern_7_t(delta, randvar_nu_t, randvar_nu_inattention_t, mat_1_t, vec_1, vec_2, market_ids, nu_bern)
@assert s5t≈s7t
@btime predict_shares_bern_5_t($delta, $randvar_nu_t, $randvar_nu_inattention_t, $mat_1_t, $vec_1, $market_ids, $nu_bern)
@btime predict_shares_bern_7_t($delta, $randvar_nu_t, $randvar_nu_inattention_t, $mat_1_t, $vec_1, $vec_2, $market_ids, $nu_bern)
nothing
end
hope it helps
Those @inbounds
outside the function definition do not work.
julia> @inbounds get(x,i) = x[i]
get (generic function with 1 method)
julia> get2(x,i) = @inbounds x[i]
get2 (generic function with 1 method)
julia> x = rand(3);
julia> @code_llvm debuginfo=:none get(x, 1)
; Function Signature: get(Array{Float64, 1}, Int64)
define double @julia_get_2224({}* noundef nonnull align 16 dereferenceable(40) %"x::Array", i64 signext %"i::Int64") #0 {
top:
%0 = add i64 %"i::Int64", -1
%1 = bitcast {}* %"x::Array" to { i8*, i64, i16, i16, i32 }*
%.length_ptr = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1
%.length = load i64, i64* %.length_ptr, align 8
%inbounds = icmp ult i64 %0, %.length
br i1 %inbounds, label %idxend, label %oob
oob: ; preds = %top
%errorbox = alloca i64, align 8
store i64 %"i::Int64", i64* %errorbox, align 8
call void @ijl_bounds_error_ints({}* %"x::Array", i64* nonnull %errorbox, i64 1)
unreachable
idxend: ; preds = %top
%2 = bitcast {}* %"x::Array" to double**
%.data1 = load double*, double** %2, align 8
%3 = getelementptr inbounds double, double* %.data1, i64 %0
%.ref = load double, double* %3, align 8
ret double %.ref
}
julia> @code_llvm debuginfo=:none get2(x, 1)
; Function Signature: get2(Array{Float64, 1}, Int64)
define double @julia_get2_2315({}* noundef nonnull align 16 dereferenceable(40) %"x::Array", i64 signext %"i::Int64") #0 {
top:
%0 = add i64 %"i::Int64", -1
%1 = bitcast {}* %"x::Array" to double**
%.data1 = load double*, double** %1, align 8
%2 = getelementptr inbounds double, double* %.data1, i64 %0
%.ref = load double, double* %2, align 8
ret double %.ref
}
Thanks a ton. I only have a problem with
@views @inbounds function foo2_t!(num,market_ids)
Threads.@threads for id ∈ market_ids
s = sum(num[:,id])
num[:,id] ./= s+1
end
end
This introduces deviations. Do you know why?
i.e., compare this
function foo1_t!(num,market_ids)
@inbounds for id ∈ market_ids
@views s = sum(num[:,id])
@views num[:,id] ./= s+1
end
end
@views @inbounds function foo2_t!(num,market_ids)
Threads.@threads for id ∈ market_ids
s = sum(num[:,id])
num[:,id] ./= s+1
end
end
mat_1_t = similar(randvar_nu_t)
mat_2_t = similar(randvar_nu_t)
bar_t!(mat_1_t,randvar_nu_t,delta)
bar_t!(mat_2_t,randvar_nu_t,delta)
mat_1_t == mat_2_t
foo1_t!(mat_1_t, market_ids)
foo2_t!(mat_2_t, market_ids)
mat_1_t == mat_2_t
You can see, that the two results are not the same at the end, but they are after the bar_t!() function.
Does not matter too much, you helped me a ton! I just thought it is weird that somehow Threads.@threads changes the result.
Are all market_ids unique? If not, you can’t Threads.@threads
them without special care.
Also, to repeat my ignored earlier comment: those @inbounds
are not doing anything, because it does not penetrate function boundaries.
If you want @inbounds
, move it inside.
Note also that Threads.@threads
introduces a function boundary (a closure).
So move it inside the for loop (if you don’t simply delete it altogether).