Is there way to avoid `materialize` and `copy`?


#1

I have a function createLLHandGRAD(data, kw...) which returns several callbacks.
One of them is the LLH function

function LLH(pars)
    # get a vector of parameters
    @inbounds res = sum(log(COHSQ(EXTND(data[e,:]).*pars)) for e in 1:Nd)
    bilin = Nd * (pars ⋅ (BM * pars)) # BM is a matrix
    - res + bilin # just a number
end

By profiling, I realized that in addition to calculation time(first towel) and _getindex(last towel)
40% of the time is spent to materialize and copy.

Is it something one can expect? Are there any performance tips to me?

Thank you in advance.


#2

I would guess data[e,:] is costly. Not only are you allocating the slice data[e,:], you are also accessing the data in a suboptimal pattern. Access according to data[:,e] would be more efficient since julia uses column major storage. Would it be possible to transpose the data array?

To avoid allocation once the access pattern is changed, try @view(data[:,e])

Edit: The problem with the access pattern is probably only significant if data is larger than your smallest cache size. The allocation is probably always an issue. It’s hard to say anything more without knowing what EXTND and COHSQ are doing.


#3

Thank you for the reply. Cool, it believe it will help.
The data is a huge array ~100x1e6.
I will report as soon as I test it.


#4

This is hard to fix without more information about the variables involved.

data, Nd, and BM are not input arguments. Are they globals? That could be bad. In general, it’s advantageous to pass everything as input arguments to your function.

You are creating a lot of temporary arrays. Try to use fused broadcasting, where appropriate:

sum(log.(COHSQ.(EXTND.(data[e,:]).*pars)) for e in 1:Nd)

and maybe views instead of slices.

It is possible to get good performance with vectorized expressions, but it is also easy to fail. Try to rewrite the res expression using loops, and you might get dramatic speed-ups.

pars ⋅ (BM * pars) might be inefficient. I would guess there’s a BLAS call that efficiently calculates x'*A*x type expressions, though I’m not to well versed in BLAS gymnastics.

Edit: Not sure about the BLAS thing now, but, at least on my computer, it’s faster to do

pars' * BM * pars

#5

Oh, and I would strongly advise you to use explicit return statements:

return -res + bilin

I read your code several times, thinking that the last line was a continuation of the expression on the second-to-last line. Implicit return statements can be very confusing.


#6

Thank you for replies and sorry for providing not enough details.

  • EXTND(v) creates a new vector by picking components of the v
  • COHSQ(v) calculates a bilinear form of the v

To the comment

As far as I understand the fused broadcasting is element-wise operations. I do not see how to translate my operations EXTND and COHSQ to ones which could be broadcasted.


#7

It is really very hard to say anything at all without knowing the specifics of COHSQ and EXTND.

In general, you will get much more accurate help if it is actually possible to run your code.


#8

Well, I just had to take a guess.


#9

Thank you for the advice. The code is not that simple, I was afraid that nobody would bother to look.
Anyway, the code is in github, indeed.

I will finish my tests and try to summarize with a simple example.


#10

The suggestions of the memory access and @view helped.
I have got from 300ms to 260ms for the likelihood function
and from 1.12s to 0.959s for the gradient, which is something already.

Copying and materializing is still a quite some part of the total time.

For some reason, Juno profiler does not tell me where exactly the time is spent in my code showing directly broadcast.jl. I should probably investigate other profiling methods.


#11

I think you’re getting distracted by the names here. They’re not just re-arranging memory — they’re actually doing the execution of the broadcast. They really are a part of your calculation. The part there that’s not is similar — that’s just allocating the space for the (likely intermediate) result. I’m a little surprised it’s so big compared to the execution of the broadcast, but it’s looking like your broadcast is just a single .* and I’m guessing that there’s also lots of GC time collecting temporaries that ends up getting attributed to similar. But hey, I suppose I shouldn’t be surprised since I still don’t know what code you’re running.

How to fix it? Do the broadcast in-place with a pre-allocated scratch buffer (with .=). Or work to make things fuse farther. Or use more laziness to do the multiplication “on the fly” as you need it instead of allocating a temporary.

Have you seen:

https://docs.julialang.org/en/v1/manual/performance-tips/


#12

Significant time spent allocating large temporaries is something I have also observed in other contexts.
If you @(b)time your operation, how much time is spent in GC?
There is also one or more issues about this: https://github.com/JuliaLang/julia/pull/29290


#13

Thank you for your replies. I think I got an idea in which direction to research.
I should also come up with a simple example.

Benchmarking:

julia> @benchmark LLH($(pars0))
  memory estimate:  385.79 MiB
  allocs estimate:  1705852
  --------------
  minimum time:     212.383 ms (14.96% GC)
  median time:      225.188 ms (15.65% GC)
  mean time:        225.274 ms (15.46% GC)
  maximum time:     237.897 ms (14.86% GC)
  --------------
  samples:          23
  evals/sample:     1
julia> LLH(pars0)
  0.309661 seconds (1.71 M allocations: 385.791 MiB, 18.37% gc time)
elapsed time (ns): 309661086
gc time (ns):      56893873
bytes allocated:   404530672
pool allocs:       1584010
non-pool GC allocs:121846
GC pauses:         17

#14

As a start I would suggest creating a version of EXTND! that takes as input a temp variable where the output should be stored.

Changing cohsq to use the abs2 as a function parameter should also help to prevent it from creating another temporary to store the abs2 values:

function cohsq(X, pblocks)
    sum(abs2, (sum(@view(X[bl]))) for bl in pblocks)
end

Again in LLH, either use sum(log.( or sum(log, although if Nd isn’t big, it might not make much of a difference.


#15

Wow! It gained a factor of 2 in speed and 5-6 times less garbage.
Thank you, Master!

  memory estimate:  32.54 MiB
  allocs estimate:  1401242
  --------------
  minimum time:     135.111 ms (2.09% GC)
  median time:      147.920 ms (3.48% GC)
  mean time:        150.981 ms (3.39% GC)
  maximum time:     171.962 ms (6.02% GC)
  --------------
  samples:          34
  evals/sample:     1

New LLH summation:

    function LLH(pars)
        v = EXTND(@view(transposedPsiDT[:,1]))
        res = sum(log,
            let
                EXTND!(v,@view(transposedPsiDT[:,e]))
                v .*= pars
                COHSQ(v)
            end for e in 1:Nd)
        bilin = Nd * (pars' * BM * pars)
        return -res + bilin
    end

#16

For some reason, Juno profiler does not tell me where exactly the time is spent in my code showing directly broadcast.jl . I should probably investigate other profiling methods.

You may be interested in trying StatProfilerHTML, which gives line-by-line explorable code. (Disclosure: I wrote the package.)

It’s available through add StatProfilerHTML. If you’re on Julia 1.0 and if you use macros, it’s a bit hampered by a line number bug in Julia itself, but a fix should soon be available in the next point release.


#17

Now you still allocate and clear 1.4 million objects with a mean size of 24 bytes. So, these must be mostly tiny wrapper objects.

I assume pars is your shorthand for “various parameters I don’t want to post on discourse”? Or are Nd, transposedPsiDT and BM globals? In that case, consider passing them as parameters. Type-instability in LLH is relatively benign, since you only execute very few expensive operations, but this would clear up the profiles a lot.

If they are closures (LLH is inner function), then the same applies: Closures without boxes are tricky, so it is often better to just forget that inner functions are a feature that julia supports.


#18

Thank you for the comment.

With last improvements (got rid of EXTND), the LLH call is 80ms

    function LLH(pars)
        res = sum(log,
                sum(abs2,
                    @view(transposedExtendedPsiDT[bl,e]) ⋅ @view(pars[bl])
                for bl in pblocks)
                    for e in 1:Nd)
        bilin = Nd * (pars' * BM * pars)
        return -res + bilin
    end

however, still 1.5M allocations. I guess internal of sum…(?)

pars is a vector of Float64. The Nd , transposedPsiDT and BM are local values of the function which returns closure LLH?

What do you mean with boxes?
It is so convenient to create likelihood function for given probability density and data set. How else could I archive that?


#19

You can just click on one of the boxes in the Profiler Pane and that’ll take you to the source code, which will allow you to figure out what exactly is taking a long time:


#20

That is right. I use this nice tool every day.
I just sometimes have an impression that it skips some intermediate blocks and point me directly to internal Base functions modules which I do not touch usually and it is also not straightforward to figure out what does wrong.