Comparing Numba and Julia for a complex matrix computation

brenhinkeller · July 18, 2021, 8:45pm

This seems to be another case where the best optimization advice for such numeric code is just to write simple loops and arrays, indexing element-wise, and then add LoopVectorization

jroon · July 18, 2021, 8:51pm

brenhinkeller:

It looks like @jroon’s simplified and allocation-free approach also makes it easy to add both SIMD vectorization and multithreading with the @tturbo macro from @elrod’s always awesome LoopVectorization.jl:

using LoopVectorization
function computeSumHours5_lv(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
    height, width = size(height_mat)
    short_elevations = @view enlarged_size[y:y+height-1, x:x+width-1]
    result = 0.0

    @tturbo for x2 in 1:width
        for y2 in 1:height
            overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
            # Note that we have to eliminate the branch to SIMD-vectorize:
            result += overshadowed * weights[ angle_mat[y2, x2] ]
        end
    end

    return result
end

function computeAllLines5_lv(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
    for x in 1:size(org_sized, 2) - 1
        for y  in 1:size(org_sized, 1) - 1
            shadow_time_hrs[x, y] = computeSumHours5_lv(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
        end
    end
    return shadow_time_hrs
end

Which gives us more than a 16x speedup on an 8-core CPU!

Wow I gotta learn about @tturbo it seems

brenhinkeller · July 18, 2021, 8:55pm

Gotta go fast

brenhinkeller · July 18, 2021, 9:07pm

Basically, the magic starts with the @turbo macro (originally called @avx) which Chris Elrod wrote to take a for loop* or broadcast and emit its own custom llvm that makes good use of your CPU’s vector registers and instructions (e.g. AVX-512 and friends). More recently, they also added @tturbo, (with an extra t for “threaded”) (originally called @avxt) which uses very low overhead threading from Polyester.jl to also multithread at the same time.

*Terms and conditions: must be reorderable, must not have branches, etc.

Some recent discussion of the SIMD part on Zulip: Zulip

Ordens_Ritter · July 21, 2021, 7:21pm

Thanks @jroon for trying to simplify the code. Your made the code more simple and even faster. As for @brenhinkeller somehow I got the vague impression, that the @tturbo marco doesn’t scale so so good on some systems, but I am not completely sure yet.

I try to explain, what this example code should be an abstraction of. Also which mistake I made.

Given there is a matrix of of heights. The question is, low long is every place overshadowed. Seams like a limited submatrix of places is enough to calculate the overshadowing of a given point. I try to prepare/prepocess some matrices to simplify the actual preparation. Which is:

Given a maximum height difference the size of submatrix is static. I have prepared a matrix of height differences. If a point of the sub-matrix is higher than the height of the overshadowed place + the height difference, means it will actually overshadow the place.

Now, when I want to look at the time the place is overshadowed, this is depending on the angle segments. Like the shadowing time in the middle south is lower than the time in the west or east.

And now we come to the part of the code that is missing: Every shadow segment can throw that show only once. Example: A building throws a shadow on me, I don’t care about the shadow behind.

So what is still needed/missing is to split the old codes line:

shadowed_segments[angle] = weights[angle] * overshadowed

to have a set of angles and add the angle if overshadowed evaluates to true. Than only in the end for every angle in the set of angles the Time segment of the angle segments can be calculated.

My current implementation: is about six, seven times slower than, @jroon 's . Also @turbo does not not work on sets. So the current implementation is SUPER slow.

function computeSumHours2!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
    height, width = size(height_mat)
    short_elevations = @view enlarged_size[y:y+height, x:x+width]
    angle_set = BitSet(0)

    @inbounds for x2 ∈ 1:width
        for y2 ∈ 1:height
            angle = angle_mat[y2, x2]
            @fastmath overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
            union!(angle_set, angle*overshadowed)
        end
    end

    result = zero(eltype(weights))
    @inbounds for elem ∈ setdiff!(angle_set, 0)
        @fastmath result += weights[elem]
    end

    return result
end

function computeAllLines5!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
    for x ∈ 1:size(org_sized, 2) - 1
        @threads for y ∈ 1:size(org_sized, 1) - 1
            shadow_time_hrs[x, y] = computeSumHours2!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
        end
    end
    return shadow_time_hrs
end

Ordens_Ritter · July 31, 2021, 12:10pm

I am still trieing the improve the code for the additional restriction. The current code uses the improvements @jroon and @brenhinkeller added a while ago.
And adds an additional loop afterwards. I tried different implementations so far, but everyone had a different setback I did not understand.

I wanted to ask you about the problems of the latest two versions.
The latest version (the one with the fewest setbacks) is 25 % faster with @turbo than with @tturbo and even this version is 17 % slower than when I use @bounds instead. (Therefore I updated: LoopVectorization, didn’t change anything)

function computeSumHours3!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x, angle_set)
    height, width = size(height_mat)
    short_elevations = @view enlarged_size[y:y+height, x:x+width]
    angle_set .= zero(eltype(angle_set))

    for x2 ∈ 1:width
        @inbounds for y2 ∈ 1:height
            angle = angle_mat[y2, x2]
            @fastmath overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
            angle_set[angle] |= overshadowed
        end
    end

    result = zero(eltype(weights));
    @inbounds for elem ∈ 1:length(weights)
        @fastmath result += weights[elem] * angle_set[elem];
    end

    return result
end

I added the parameter:
angle_set = zeros(Bool, length(weights))

The version I tried before:

        0     height, width = size(height_mat)
        0     short_elevations = @view enlarged_size[y:y+height, x:x+width]
        0     angle_set .= zero(eltype(angle_set))
        - 
        0     for x2 ∈ 1:width
        0         @turbo for y2 ∈ 1:height
        -             angle = angle_mat[y2, x2]
        -             @fastmath overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
        0             angle_set[angle] |= overshadowed
        -         end
        -     end
        - 
        -     result = zero(eltype(weights))
     7456     angle_list = findall(angle_set .== true)
        0     @inbounds for elem ∈ angle_list
        0         @fastmath result += weights[elem]
        -     end
        - 
        0     return result
        - end

It fails when the second loop uses @turbo because the iteration isn’t of the form: 1:length(…) and it uses tons of memory for `findall(angle_set .== true). Which I think is very strange, because the array has only 361 entries.

jling · July 31, 2021, 12:13pm

isn’t this just findall(angle_set)?.. .== true will allocate a new array for no reason.

Sukera · July 31, 2021, 1:39pm

Ordens_Ritter:

        -     result = zero(eltype(weights))
     7456     angle_list = findall(angle_set .== true)
        0     @inbounds for elem ∈ angle_list
        0         @fastmath result += weights[elem]
        -     end

Could also be written like this, which should be easier for the auto vectorizer to understand:

result = zero(eltype(weights))
for elem in eachindex(angle_set)
    result += weights[elem] * angle_set[elem] # true/false is a number in julia
end

Looks like a classic case for mapreduce though (or threaded equivalents).

Ordens_Ritter · July 31, 2021, 1:43pm

superbe, reduces the memory usage somewhat from 7,4 kb to 2.95 kb. Still I don’t understand why, there is so much memory usage left. Does’t it mean, that there is something in between allocated saved as Int64?

jling · July 31, 2021, 1:50pm

If these were 64-bit numbers, data alone takes

julia> 361 * 8
2888 #bytes

and there’s some tag data:

julia> Base.summarysize(rand(361))
2928

So I don’t think allocation on this order of magnitude is unacceptable.

Topic		Replies	Views
Julia Beginner (from Python): Numba outperforms Julia in rewrite. Any tips to improve performance? Performance benchmark , python , tullio , loopvectorization	56	6086	August 18, 2021
Python vs. Julia ODE Solver Performance python	30	4367	March 3, 2021
Speed up Julia code for simple Monte Carlo Pi estimation (compared to Numba) Performance performance	20	4009	August 22, 2021
Benchmarks: Julia vs Python+Numba Performance benchmark , python	4	5404	January 31, 2018
Julia is very slow compared to python/numpy. What am I doing wrong here? Performance benchmark , matrices	25	3008	April 11, 2022

Comparing Numba and Julia for a complex matrix computation

Related topics