This seems to be another case where the best optimization advice for such numeric code is just to write simple loops and arrays, indexing element-wise, and then add LoopVectorization
Wow I gotta learn about
@tturbo it seems
Gotta go fast
Basically, the magic starts with the
@turbo macro (originally called
@avx) which Chris Elrod wrote to take a
for loop* or broadcast and emit its own custom llvm that makes good use of your CPU’s vector registers and instructions (e.g. AVX-512 and friends). More recently, they also added
@tturbo, (with an extra
t for “threaded”) (originally called
@avxt) which uses very low overhead threading from Polyester.jl to also multithread at the same time.
*Terms and conditions: must be reorderable, must not have branches, etc.
Some recent discussion of the SIMD part on Zulip: JuliaLang
Thanks @jroon for trying to simplify the code. Your made the code more simple and even faster. As for @brenhinkeller somehow I got the vague impression, that the
@tturbo marco doesn’t scale so so good on some systems, but I am not completely sure yet.
I try to explain, what this example code should be an abstraction of. Also which mistake I made.
Given there is a matrix of of heights. The question is, low long is every place overshadowed. Seams like a limited submatrix of places is enough to calculate the overshadowing of a given point. I try to prepare/prepocess some matrices to simplify the actual preparation. Which is:
Given a maximum height difference the size of submatrix is static. I have prepared a matrix of height differences. If a point of the sub-matrix is higher than the height of the overshadowed place + the height difference, means it will actually overshadow the place.
Now, when I want to look at the time the place is overshadowed, this is depending on the angle segments. Like the shadowing time in the middle south is lower than the time in the west or east.
And now we come to the part of the code that is missing: Every shadow segment can throw that show only once. Example: A building throws a shadow on me, I don’t care about the shadow behind.
So what is still needed/missing is to split the old codes line:
shadowed_segments[angle] = weights[angle] * overshadowed
to have a set of angles and add the angle if overshadowed evaluates to true. Than only in the end for every angle in the set of angles the Time segment of the angle segments can be calculated.
My current implementation: is about six, seven times slower than, @jroon 's . Also
@turbo does not not work on sets. So the current implementation is SUPER slow.
function computeSumHours2!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x) height, width = size(height_mat) short_elevations = @view enlarged_size[y:y+height, x:x+width] angle_set = BitSet(0) @inbounds for x2 ∈ 1:width for y2 ∈ 1:height angle = angle_mat[y2, x2] @fastmath overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2] union!(angle_set, angle*overshadowed) end end result = zero(eltype(weights)) @inbounds for elem ∈ setdiff!(angle_set, 0) @fastmath result += weights[elem] end return result end function computeAllLines5!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights) for x ∈ 1:size(org_sized, 2) - 1 @threads for y ∈ 1:size(org_sized, 1) - 1 shadow_time_hrs[x, y] = computeSumHours2!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x) end end return shadow_time_hrs end