I have a function createLLHandGRAD(data, kw...) which returns several callbacks.
One of them is the LLH function
function LLH(pars)
# get a vector of parameters
@inbounds res = sum(log(COHSQ(EXTND(data[e,:]).*pars)) for e in 1:Nd)
bilin = Nd * (pars ⋅ (BM * pars)) # BM is a matrix
- res + bilin # just a number
end
By profiling, I realized that in addition to calculation time(first towel) and _getindex(last towel)
40% of the time is spent to materialize and copy.
I would guess data[e,:] is costly. Not only are you allocating the slice data[e,:], you are also accessing the data in a suboptimal pattern. Access according to data[:,e] would be more efficient since julia uses column major storage. Would it be possible to transpose the data array?
To avoid allocation once the access pattern is changed, try @view(data[:,e])
Edit: The problem with the access pattern is probably only significant if data is larger than your smallest cache size. The allocation is probably always an issue. It’s hard to say anything more without knowing what EXTND and COHSQ are doing.
This is hard to fix without more information about the variables involved.
data, Nd, and BM are not input arguments. Are they globals? That could be bad. In general, it’s advantageous to pass everything as input arguments to your function.
You are creating a lot of temporary arrays. Try to use fused broadcasting, where appropriate:
sum(log.(COHSQ.(EXTND.(data[e,:]).*pars)) for e in 1:Nd)
and maybe views instead of slices.
It is possible to get good performance with vectorized expressions, but it is also easy to fail. Try to rewrite the res expression using loops, and you might get dramatic speed-ups.
pars ⋅ (BM * pars) might be inefficient. I would guess there’s a BLAS call that efficiently calculates x'*A*x type expressions, though I’m not to well versed in BLAS gymnastics.
Edit: Not sure about the BLAS thing now, but, at least on my computer, it’s faster to do
Oh, and I would strongly advise you to use explicit return statements:
return -res + bilin
I read your code several times, thinking that the last line was a continuation of the expression on the second-to-last line. Implicit return statements can be very confusing.
Thank you for replies and sorry for providing not enough details.
EXTND(v) creates a new vector by picking components of the v
COHSQ(v) calculates a bilinear form of the v
To the comment
As far as I understand the fused broadcasting is element-wise operations. I do not see how to translate my operations EXTND and COHSQ to ones which could be broadcasted.
The suggestions of the memory access and @view helped.
I have got from 300ms to 260ms for the likelihood function
and from 1.12s to 0.959s for the gradient, which is something already.
Copying and materializing is still a quite some part of the total time.
For some reason, Juno profiler does not tell me where exactly the time is spent in my code showing directly broadcast.jl. I should probably investigate other profiling methods.
I think you’re getting distracted by the names here. They’re not just re-arranging memory — they’re actually doing the execution of the broadcast. They really are a part of your calculation. The part there that’s not is similar — that’s just allocating the space for the (likely intermediate) result. I’m a little surprised it’s so big compared to the execution of the broadcast, but it’s looking like your broadcast is just a single .* and I’m guessing that there’s also lots of GC time collecting temporaries that ends up getting attributed to similar. But hey, I suppose I shouldn’t be surprised since I still don’t know what code you’re running.
How to fix it? Do the broadcast in-place with a pre-allocated scratch buffer (with .=). Or work to make things fuse farther. Or use more laziness to do the multiplication “on the fly” as you need it instead of allocating a temporary.
Wow! It gained a factor of 2 in speed and 5-6 times less garbage.
Thank you, Master!
memory estimate: 32.54 MiB
allocs estimate: 1401242
--------------
minimum time: 135.111 ms (2.09% GC)
median time: 147.920 ms (3.48% GC)
mean time: 150.981 ms (3.39% GC)
maximum time: 171.962 ms (6.02% GC)
--------------
samples: 34
evals/sample: 1
New LLH summation:
function LLH(pars)
v = EXTND(@view(transposedPsiDT[:,1]))
res = sum(log,
let
EXTND!(v,@view(transposedPsiDT[:,e]))
v .*= pars
COHSQ(v)
end for e in 1:Nd)
bilin = Nd * (pars' * BM * pars)
return -res + bilin
end
For some reason, Juno profiler does not tell me where exactly the time is spent in my code showing directly broadcast.jl . I should probably investigate other profiling methods.
It’s available through add StatProfilerHTML. If you’re on Julia 1.0 and if you use macros, it’s a bit hampered by a line number bug in Julia itself, but a fix should soon be available in the next point release.
Now you still allocate and clear 1.4 million objects with a mean size of 24 bytes. So, these must be mostly tiny wrapper objects.
I assume pars is your shorthand for “various parameters I don’t want to post on discourse”? Or are Nd, transposedPsiDT and BM globals? In that case, consider passing them as parameters. Type-instability in LLH is relatively benign, since you only execute very few expensive operations, but this would clear up the profiles a lot.
If they are closures (LLH is inner function), then the same applies: Closures without boxes are tricky, so it is often better to just forget that inner functions are a feature that julia supports.
With last improvements (got rid of EXTND), the LLH call is 80ms
function LLH(pars)
res = sum(log,
sum(abs2,
@view(transposedExtendedPsiDT[bl,e]) ⋅ @view(pars[bl])
for bl in pblocks)
for e in 1:Nd)
bilin = Nd * (pars' * BM * pars)
return -res + bilin
end
however, still 1.5M allocations. I guess internal of sum…(?)
pars is a vector of Float64. The Nd , transposedPsiDT and BM are local values of the function which returns closure LLH?
What do you mean with boxes?
It is so convenient to create likelihood function for given probability density and data set. How else could I archive that?
You can just click on one of the boxes in the Profiler Pane and that’ll take you to the source code, which will allow you to figure out what exactly is taking a long time:
That is right. I use this nice tool every day.
I just sometimes have an impression that it skips some intermediate blocks and point me directly to internal Base functions modules which I do not touch usually and it is also not straightforward to figure out what does wrong.