Optimizing performance (newcomer from Matlab)

Hi all,

I am encountering slower than hoped for performance in my usage of Julia (in comparison to Matlab). I am sure that my coding style is largely responsible for this, but would appreciate any pointers.

I am calculating the probability of observing a given set (conditional on some values and a cut-off). I have tried to abstract away from the problem as much as possible and give a MWE that’s as minimal as possible. I’ve annotated to provide some insights on what I doing. The Julia profiler and Matlab profiler differ a lot, up to the point that I find it very hard to navigate any output the profiler gives me (I’m not a programmer, just an applied user).


using Combinatorics

a                   = 5;
values              = randn(5, 15, 500)
sets                = collect(powerset(1:a))
deleteat!(sets,1)

inv_set             = collect(1:a)
p_set               = zeros(size(sets,1), 15, 500)

@time for Z in 1:size(sets,1)
    current_set             = vcat(hcat(sets[Z,:]...)...); # get the vector with the current set
    inv_current_set         = setdiff(inv_set,hcat(sets[Z,:]...)); # get the vector with inverse of current set
    first_part              = prod(1 .-exp.(-exp.(-1 .*(0 .-values[current_set,:,:]))), dims = 1); # calculate p current set
    second_part             = prod(exp.(-exp.(-1 .*(0 .-values[inv_current_set,:,:]))), dims = 1); # calculate p inv set
    third_part              = prod(exp.(-exp.(-1 .*(0 .-values[:,:,:]))), dims = 1);    # calculate p empty set
    p_set[Z,:,:]            = (first_part.*second_part) ./ (1 .- third_part); # wrap it all up
end

versioninfo() generates:
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.6.0)
CPU: Intel® Core™ i7-7920HQ CPU @ 3.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4
JULIA_EDITOR = atom -a

PS: using threads speeds it up (~2x), but that’s as far as I get

Welcome!

Before going too deep into optimizing the problem head over to the performance tips which lists common pitfalls. Two things stand out: the use of non-constant globals (which are necessary to avoid for optimal performance) and using @time which can include compilation time in the results. To get more meaningful results usually wrapping things in a function and using @btime from BenchmarkTools is the preferred way to go.

1 Like

Welcome to Julia! Without having the time to look at your code, I’ll just point you to
https://docs.julialang.org/en/v1/manual/performance-tips/#man-performance-tips-1

I can definitely see that you didn’t put your code into a function.

Why write

vcat(hcat(sets[Z,:]...)...);

instead of

sets[Z]

for example?

It lives inside a function in my actual code. For the sake of the MWE I moved it out, now back in.

I’ve changed the code as suggested by kristoffer.carlsson, but it doesn’t shave off a lot of time.
See for an updated code below:


using Combinatorics

a                   = 5;
values              = randn(5,15,500)
sets                = collect(powerset(1:a))
deleteat!(sets,1)

inv_set             = collect(1:a)
p_set               = zeros(size(sets,1), 15, 500)

function MWE_function()
    for Z in 1:size(sets,1)
    current_set             = sets[Z];
    inv_current_set         = setdiff(inv_set,sets[Z]);
    first_part              = prod(1 .-exp.(-exp.(-1 .*(0 .-values[current_set,:,:]))), dims = 1);
    second_part             = prod(exp.(-exp.(-1 .*(0 .-values[inv_current_set,:,:]))), dims = 1);
    third_part              = prod(exp.(-exp.(-1 .*(0 .-values[:,:,:]))), dims = 1);
    p_set[Z,:,:]            = (first_part.*second_part) ./ (1 .- third_part);
    end
end

@time MWE_function()

This still leaves me with roughly 0.129681 seconds (3.01 k allocations: 78.161 MiB, 5.85% gc time)

I’ll have a look at the benchmarktools, but still happy with any other pointers.

You still have non-constant globals here (a, values, etc). Instead, make MWE_function take them as arguments. That way, Julia can compile a specialized version of MWE_function based on the types of those arguments, which is one way Julia gets good performance.

As we said, please look at the performance tips. For example, the first three topics are

  1. Avoid global variables (already mentioned)
  2. Measure performance with @time and pay attention to memory allocation (you got a lot of allocations there, probably some of them can be avoided by reusing memory)
  3. Tools (suggestions for a visual profiler output)

By the way, isn’t -1 .*(0 .-values[current_set,:,:]) == values[current_set,:,:]?
Also, you don’t need any of the semicolon in you code, statements don’t have to be terminated if you just have one per line.

What is the matlab time one the same computer?

Just to start a little bit without doing a deep dive. Two things, it is better to put the indices for current_set and inv_current_set as the last indices. So instead of values being 5 x 15 x 500 it is 15 x 500 x 5 and you index the last index and change to dims = 3.

Also, it seems you are computing exp on many values in values multiple times. Might as well pull that out. You can throw in some @views as well. So something like the code below should be a bit faster but it can very likely be made much faster. It depends on how much effort you are willing to put in and how ugly the code is allowed to be.

function runit()
    a                   = 5;
    values              = randn(15, 500, 5)
    sets                = collect(powerset(1:a))
    deleteat!(sets,1)

    inv_set             = collect(1:a)
    p_set               = zeros(15, 500, size(sets,1))
    exp_values = exp.(-exp.(values))

     @time for Z in 1:size(sets,1)
         current_set = sets[Z]
         inv_current_set = setdiff(inv_set, sets[Z]) # get the vector with inverse of current set

         # exp_values = exp.(-exp.(values)) # moved outside loop based on reply
        
         first_part = prod(1 .- @view exp_values[:, :, current_set]; dims=3)
         second_part = prod(@view exp_values[:, :, inv_current_set]; dims=3)
         third_part = prod(exp_values; dims=3)
         
         p_set[:,:,Z] = @. (first_part * second_part) / (1 - third_part); # wrap it all up
    end
    return p_set
end
4 Likes

Maybe move that line out of the loop, for good measure :wink:

Ah, of course. And that cuts the runtime something tremendous. The original code does a bit too much per line imo. Moving things out makes it more clear what is loop invariant and can be hoisted.

1 Like

Firstly: very impressed by the speed and the amount of pointers you have given me.

  • Moving stuff out of the loop helped a lot.
  • In my main code there are no global vars, I think that in creating the MWE I probably offended more of the performance tips than I did in my original code (not putting it in a function, using globals) - either way, lesson learnt!
  • Don’t really understand why using the last indices has better performance (I guess something with storing/accessing it
  • Don’t understand views, but will look into that
  • I was under the impression that more per line would avoid temporary storage and be faster, which makes my coding style a little bit dense
  • @Karajan, in my optimization, the "-1 .(0 .-values[current_set,:,:])" part contains a parameter rather than the 0. In the following manner: -1 .(beta .-values[current_set,:,:]). Otherwise I could indeed simplify that part.

For my use case this (for now) is already good enough of a speed-up, so thanks again to all, and in specific to @kristoffer.carlsson!

4 Likes

The reason the array index order matters is that in Julia, arrays are stored column major, meaning that looping over the last dimension keeps the indices close in memory, which makes caches efficient

1 Like

Don’t understand views, but will look into that

Every time you do something like matrix[1, :] a vector gets allocated so you can work with it. @view avoids that and points to the underlying matrix instead. For example:

julia> const a = rand(100);

julia> @btime sum(a[1:50])
  78.759 ns (1 allocation: 496 bytes)
25.72627355400477

julia> @btime sum(a)
  19.883 ns (0 allocations: 0 bytes)
50.702224176185574

a[1:50] allocates so in this example summing half the array is slower than doing the whole thing.

I was under the impression that more per line would avoid temporary storage and be faster, which makes my coding style a little bit dense

This can be true but it doesn’t have to be. For example using . to utilize loop fusion can make things faster and for this you have to write things into one line. But the above example wouldn’t be faster than

b = a[1:50]
sum(b)

Putting this all into action and trying to avoid the temporary arrays:

function runit()
    a                   = 5;
    values              = randn(15, 500, 5)
    sets                = collect(powerset(1:a))[2:end]

    inv_set             = collect(1:a)
    p_set               = zeros(15, 500, size(sets,1))

    exp_values = exp.(-exp.(values))
    third_part = prod(exp_values; dims=3)
    
    @time @inbounds for Z in 1:size(p_set, 3)
        current_set = sets[Z]
        inv_current_set = setdiff(inv_set, sets[Z]) # get the vector with inverse of current set

        for j in 1:size(p_set, 2)
            for i in 1:size(p_set, 1)
                first_part = 1.0
                for k in current_set
                    first_part *= 1 - exp_values[i, j, k]
                end
                second_part = 1.0
                for k in inv_current_set
                    second_part *= exp_values[i, j, k]
                end
                p_set[i,j,Z] = (first_part * second_part) / (1 - third_part[i, j]); # wrap it all up
            end
        end
    end
    return p_set
end

which is another ~4 times faster for me. Please make sure this still delivers the correct result, screwing up the code can happen quickly while playing around :smile:

As Oscar said, you want to loop over the first dimension in the tightest loop (which I didn’t do to preserve the dimensions Kristoffer used, but in a test it gave another 50%). You can see the difference when you swap the lines for j in 1:size(p_set, 2) and for i in 1:size(p_set, 1)

1 Like

I really like this thread. As someone moving from MATLAB to Julia, I write long scripts, dumping the entire workspace with lots of variables along the way. I was wondering, is there a book that teaches programming in a more julian way? (I have read how to think like a computer scientist, which now has a Julia version called “Think Julia” but was hoping to find something new)

The manual is actually very helpful that way. It is partially a reference, but partially a guidebook with really good pointers for beginners. For instance the sections on performance should be required reading. :slight_smile:

2 Likes

Thanks. Gosh, it’s like learning how to program again. I was very happy to see a style guide! https://docs.julialang.org/en/v1/manual/style-guide/

Could you recommend some other essential pages in the manual that could be useful for a beginner? I often find tutorials explaining the syntax but hardly anything teaching me to write good code.

While in the manual, notice the small links to source just below descriptions of various functions. Those take you straight to the source files on GitHub. Since most of Julia is written in Julia, that’s a great place to find examples.

1 Like

Not necessarily exactly what you had in mind, but if you come from a Matlab background, this chapter might be useful: Noteworthy differences from MATLAB.

Julia is quite a young language and best practices are still evolving, so books are scarce and become outdated quite quickly. I would recommend just reading code in Base, the standard libraries, and packages you use anyway, you can learn a lot from that. There are topics with similar discussions, eg

1 Like