Is there way to avoid `materialize` and `copy`?

misha_mikhasenko · November 13, 2018, 9:31am

I have a function createLLHandGRAD(data, kw...) which returns several callbacks.
One of them is the LLH function

function LLH(pars)
    # get a vector of parameters
    @inbounds res = sum(log(COHSQ(EXTND(data[e,:]).*pars)) for e in 1:Nd)
    bilin = Nd * (pars ⋅ (BM * pars)) # BM is a matrix
    - res + bilin # just a number
end

By profiling, I realized that in addition to calculation time(first towel) and _getindex(last towel)
40% of the time is spent to materialize and copy.

Is it something one can expect? Are there any performance tips to me?

Thank you in advance.

baggepinnen · November 13, 2018, 11:37am

I would guess data[e,:] is costly. Not only are you allocating the slice data[e,:], you are also accessing the data in a suboptimal pattern. Access according to data[:,e] would be more efficient since julia uses column major storage. Would it be possible to transpose the data array?

To avoid allocation once the access pattern is changed, try @view(data[:,e])

Edit: The problem with the access pattern is probably only significant if data is larger than your smallest cache size. The allocation is probably always an issue. It’s hard to say anything more without knowing what EXTND and COHSQ are doing.

misha_mikhasenko · November 13, 2018, 12:03pm

Thank you for the reply. Cool, it believe it will help.
The data is a huge array ~100x1e6.
I will report as soon as I test it.

DNF · November 13, 2018, 12:25pm

This is hard to fix without more information about the variables involved.

data, Nd, and BM are not input arguments. Are they globals? That could be bad. In general, it’s advantageous to pass everything as input arguments to your function.

You are creating a lot of temporary arrays. Try to use fused broadcasting, where appropriate:

sum(log.(COHSQ.(EXTND.(data[e,:]).*pars)) for e in 1:Nd)

and maybe views instead of slices.

It is possible to get good performance with vectorized expressions, but it is also easy to fail. Try to rewrite the res expression using loops, and you might get dramatic speed-ups.

pars ⋅ (BM * pars) might be inefficient. I would guess there’s a BLAS call that efficiently calculates x'*A*x type expressions, though I’m not to well versed in BLAS gymnastics.

Edit: Not sure about the BLAS thing now, but, at least on my computer, it’s faster to do

pars' * BM * pars

DNF · November 13, 2018, 12:31pm

Oh, and I would strongly advise you to use explicit return statements:

return -res + bilin

I read your code several times, thinking that the last line was a continuation of the expression on the second-to-last line. Implicit return statements can be very confusing.

misha_mikhasenko · November 13, 2018, 5:13pm

Thank you for replies and sorry for providing not enough details.

EXTND(v) creates a new vector by picking components of the v
COHSQ(v) calculates a bilinear form of the v

To the comment

As far as I understand the fused broadcasting is element-wise operations. I do not see how to translate my operations EXTND and COHSQ to ones which could be broadcasted.

kristoffer.carlsson · November 13, 2018, 5:16pm

It is really very hard to say anything at all without knowing the specifics of COHSQ and EXTND.

In general, you will get much more accurate help if it is actually possible to run your code.

DNF · November 13, 2018, 5:27pm

Well, I just had to take a guess.

misha_mikhasenko · November 13, 2018, 7:50pm

Thank you for the advice. The code is not that simple, I was afraid that nobody would bother to look.
Anyway, the code is in github, indeed.

The main function createLLHandGRAD wraps the data and returns a few functions to calculate likelihood and derivatives.
LLH function as discussed above.
The EXTND and COHSQ.

I will finish my tests and try to summarize with a simple example.

misha_mikhasenko · November 13, 2018, 9:14pm

The suggestions of the memory access and @view helped.
I have got from 300ms to 260ms for the likelihood function
and from 1.12s to 0.959s for the gradient, which is something already.

Copying and materializing is still a quite some part of the total time.

For some reason, Juno profiler does not tell me where exactly the time is spent in my code showing directly broadcast.jl. I should probably investigate other profiling methods.

mbauman · November 13, 2018, 11:18pm

I think you’re getting distracted by the names here. They’re not just re-arranging memory — they’re actually doing the execution of the broadcast. They really are a part of your calculation. The part there that’s not is similar — that’s just allocating the space for the (likely intermediate) result. I’m a little surprised it’s so big compared to the execution of the broadcast, but it’s looking like your broadcast is just a single .* and I’m guessing that there’s also lots of GC time collecting temporaries that ends up getting attributed to similar. But hey, I suppose I shouldn’t be surprised since I still don’t know what code you’re running.

How to fix it? Do the broadcast in-place with a pre-allocated scratch buffer (with .=). Or work to make things fuse farther. Or use more laziness to do the multiplication “on the fly” as you need it instead of allocating a temporary.

Have you seen:

https://docs.julialang.org/en/v1/manual/performance-tips/

juthohaegeman · November 14, 2018, 6:32am

Significant time spent allocating large temporaries is something I have also observed in other contexts.
If you @(b)time your operation, how much time is spent in GC?
There is also one or more issues about this: GC improvements for #28986 by JeffBezanson · Pull Request #29290 · JuliaLang/julia · GitHub

misha_mikhasenko · November 14, 2018, 7:48am

Thank you for your replies. I think I got an idea in which direction to research.
I should also come up with a simple example.

Benchmarking:

julia> @benchmark LLH($(pars0))
  memory estimate:  385.79 MiB
  allocs estimate:  1705852
  --------------
  minimum time:     212.383 ms (14.96% GC)
  median time:      225.188 ms (15.65% GC)
  mean time:        225.274 ms (15.46% GC)
  maximum time:     237.897 ms (14.86% GC)
  --------------
  samples:          23
  evals/sample:     1

julia> LLH(pars0)
  0.309661 seconds (1.71 M allocations: 385.791 MiB, 18.37% gc time)
elapsed time (ns): 309661086
gc time (ns):      56893873
bytes allocated:   404530672
pool allocs:       1584010
non-pool GC allocs:121846
GC pauses:         17

lwabeke · November 14, 2018, 8:20am

As a start I would suggest creating a version of EXTND! that takes as input a temp variable where the output should be stored.

Changing cohsq to use the abs2 as a function parameter should also help to prevent it from creating another temporary to store the abs2 values:

function cohsq(X, pblocks)
    sum(abs2, (sum(@view(X[bl]))) for bl in pblocks)
end

Again in LLH, either use sum(log.( or sum(log, although if Nd isn’t big, it might not make much of a difference.

misha_mikhasenko · November 14, 2018, 5:31pm

Wow! It gained a factor of 2 in speed and 5-6 times less garbage.
Thank you, Master!

  memory estimate:  32.54 MiB
  allocs estimate:  1401242
  --------------
  minimum time:     135.111 ms (2.09% GC)
  median time:      147.920 ms (3.48% GC)
  mean time:        150.981 ms (3.39% GC)
  maximum time:     171.962 ms (6.02% GC)
  --------------
  samples:          34
  evals/sample:     1

New LLH summation:

    function LLH(pars)
        v = EXTND(@view(transposedPsiDT[:,1]))
        res = sum(log,
            let
                EXTND!(v,@view(transposedPsiDT[:,e]))
                v .*= pars
                COHSQ(v)
            end for e in 1:Nd)
        bilin = Nd * (pars' * BM * pars)
        return -res + bilin
    end

tkluck · November 14, 2018, 10:07pm

For some reason, Juno profiler does not tell me where exactly the time is spent in my code showing directly broadcast.jl . I should probably investigate other profiling methods.

You may be interested in trying StatProfilerHTML, which gives line-by-line explorable code. (Disclosure: I wrote the package.)

It’s available through add StatProfilerHTML. If you’re on Julia 1.0 and if you use macros, it’s a bit hampered by a line number bug in Julia itself, but a fix should soon be available in the next point release.

foobar_lv2 · November 14, 2018, 10:08pm

Now you still allocate and clear 1.4 million objects with a mean size of 24 bytes. So, these must be mostly tiny wrapper objects.

I assume pars is your shorthand for “various parameters I don’t want to post on discourse”? Or are Nd, transposedPsiDT and BM globals? In that case, consider passing them as parameters. Type-instability in LLH is relatively benign, since you only execute very few expensive operations, but this would clear up the profiles a lot.

If they are closures (LLH is inner function), then the same applies: Closures without boxes are tricky, so it is often better to just forget that inner functions are a feature that julia supports.

misha_mikhasenko · November 15, 2018, 8:57am

Thank you for the comment.

With last improvements (got rid of EXTND), the LLH call is 80ms

    function LLH(pars)
        res = sum(log,
                sum(abs2,
                    @view(transposedExtendedPsiDT[bl,e]) ⋅ @view(pars[bl])
                for bl in pblocks)
                    for e in 1:Nd)
        bilin = Nd * (pars' * BM * pars)
        return -res + bilin
    end

however, still 1.5M allocations. I guess internal of sum…(?)

pars is a vector of Float64. The Nd , transposedPsiDT and BM are local values of the function which returns closure LLH?

What do you mean with boxes?
It is so convenient to create likelihood function for given probability density and data set. How else could I archive that?

pfitzseb · November 15, 2018, 9:02am

You can just click on one of the boxes in the Profiler Pane and that’ll take you to the source code, which will allow you to figure out what exactly is taking a long time:

misha_mikhasenko · November 15, 2018, 9:10am

That is right. I use this nice tool every day.
I just sometimes have an impression that it skips some intermediate blocks and point me directly to internal Base functions modules which I do not touch usually and it is also not straightforward to figure out what does wrong.

Topic		Replies	Views
Migrating code from Matlab: Accessing arrays without allocating Performance	20	2021	November 19, 2018
Creating the same matrix in Julia as in Matlab takes longer time New to Julia	28	2293	June 28, 2017
Unknown allocation Performance	5	590	February 10, 2020
Sum, mapreduce and broadcasted Performance broadcast	14	2991	September 23, 2018
Boltzmann Factor in a loop Performance	38	1906	April 3, 2018

Is there way to avoid `materialize` and `copy`?

Related topics