Comparing Numba and Julia for a complex matrix computation

This seems to be another case where the best optimization advice for such numeric code is just to write simple loops and arrays, indexing element-wise, and then add LoopVectorization

2 Likes

Wow I gotta learn about @tturbo it seems :sweat_smile:

1 Like

Gotta go fast :grin:

1 Like

Basically, the magic starts with the @turbo macro (originally called @avx) which Chris Elrod wrote to take a for loop* or broadcast and emit its own custom llvm that makes good use of your CPU’s vector registers and instructions (e.g. AVX-512 and friends). More recently, they also added @tturbo, (with an extra t for “threaded”) (originally called @avxt) which uses very low overhead threading from Polyester.jl to also multithread at the same time.

*Terms and conditions: must be reorderable, must not have branches, etc.

Some recent discussion of the SIMD part on Zulip: Zulip

3 Likes

Thanks @jroon for trying to simplify the code. Your made the code more simple and even faster. As for @brenhinkeller somehow I got the vague impression, that the @tturbo marco doesn’t scale so so good on some systems, but I am not completely sure yet.

I try to explain, what this example code should be an abstraction of. Also which mistake I made.

Given there is a matrix of of heights. The question is, low long is every place overshadowed. Seams like a limited submatrix of places is enough to calculate the overshadowing of a given point. I try to prepare/prepocess some matrices to simplify the actual preparation. Which is:

Given a maximum height difference the size of submatrix is static. I have prepared a matrix of height differences. If a point of the sub-matrix is higher than the height of the overshadowed place + the height difference, means it will actually overshadow the place.

Now, when I want to look at the time the place is overshadowed, this is depending on the angle segments. Like the shadowing time in the middle south is lower than the time in the west or east.

And now we come to the part of the code that is missing: Every shadow segment can throw that show only once. Example: A building throws a shadow on me, I don’t care about the shadow behind.

So what is still needed/missing is to split the old codes line:

shadowed_segments[angle] = weights[angle] * overshadowed

to have a set of angles and add the angle if overshadowed evaluates to true. Than only in the end for every angle in the set of angles the Time segment of the angle segments can be calculated.

My current implementation: is about six, seven times slower than, @jroon 's . Also @turbo does not not work on sets. So the current implementation is SUPER slow.

function computeSumHours2!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
    height, width = size(height_mat)
    short_elevations = @view enlarged_size[y:y+height, x:x+width]
    angle_set = BitSet(0)

    @inbounds for x2 ∈ 1:width
        for y2 ∈ 1:height
            angle = angle_mat[y2, x2]
            @fastmath overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
            union!(angle_set, angle*overshadowed)
        end
    end

    result = zero(eltype(weights))
    @inbounds for elem ∈ setdiff!(angle_set, 0)
        @fastmath result += weights[elem]
    end

    return result
end

function computeAllLines5!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
    for x ∈ 1:size(org_sized, 2) - 1
        @threads for y ∈ 1:size(org_sized, 1) - 1
            shadow_time_hrs[x, y] = computeSumHours2!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
        end
    end
    return shadow_time_hrs
end

I am still trieing the improve the code for the additional restriction. The current code uses the improvements @jroon and @brenhinkeller added a while ago.
And adds an additional loop afterwards. I tried different implementations so far, but everyone had a different setback I did not understand.

I wanted to ask you about the problems of the latest two versions.
The latest version (the one with the fewest setbacks) is 25 % faster with @turbo than with @tturbo and even this version is 17 % slower than when I use @bounds instead. (Therefore I updated: LoopVectorization, didn’t change anything)

function computeSumHours3!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x, angle_set)
    height, width = size(height_mat)
    short_elevations = @view enlarged_size[y:y+height, x:x+width]
    angle_set .= zero(eltype(angle_set))

    for x2 ∈ 1:width
        @inbounds for y2 ∈ 1:height
            angle = angle_mat[y2, x2]
            @fastmath overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
            angle_set[angle] |= overshadowed
        end
    end

    result = zero(eltype(weights));
    @inbounds for elem ∈ 1:length(weights)
        @fastmath result += weights[elem] * angle_set[elem];
    end

    return result
end

I added the parameter:
angle_set = zeros(Bool, length(weights))

The version I tried before:

        0     height, width = size(height_mat)
        0     short_elevations = @view enlarged_size[y:y+height, x:x+width]
        0     angle_set .= zero(eltype(angle_set))
        - 
        0     for x2 ∈ 1:width
        0         @turbo for y2 ∈ 1:height
        -             angle = angle_mat[y2, x2]
        -             @fastmath overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
        0             angle_set[angle] |= overshadowed
        -         end
        -     end
        - 
        -     result = zero(eltype(weights))
     7456     angle_list = findall(angle_set .== true)
        0     @inbounds for elem ∈ angle_list
        0         @fastmath result += weights[elem]
        -     end
        - 
        0     return result
        - end

It fails when the second loop uses @turbo because the iteration isn’t of the form: 1:length(…) and it uses tons of memory for `findall(angle_set .== true). Which I think is very strange, because the array has only 361 entries.

isn’t this just findall(angle_set)?.. .== true will allocate a new array for no reason.

2 Likes

Could also be written like this, which should be easier for the auto vectorizer to understand:

result = zero(eltype(weights))
for elem in eachindex(angle_set)
    result += weights[elem] * angle_set[elem] # true/false is a number in julia
end

Looks like a classic case for mapreduce though (or threaded equivalents).

superbe, reduces the memory usage somewhat from 7,4 kb to 2.95 kb. Still I don’t understand why, there is so much memory usage left. Does’t it mean, that there is something in between allocated saved as Int64?

If these were 64-bit numbers, data alone takes

julia> 361 * 8
2888 #bytes

and there’s some tag data:

julia> Base.summarysize(rand(361))
2928

So I don’t think allocation on this order of magnitude is unacceptable.

1 Like