This seems to be another case where the best optimization advice for such numeric code is just to write simple loops and arrays, indexing element-wise, and then add LoopVectorization
Wow I gotta learn about @tturbo
it seems
Gotta go fast
Basically, the magic starts with the @turbo
macro (originally called @avx
) which Chris Elrod wrote to take a for
loop* or broadcast and emit its own custom llvm that makes good use of your CPU’s vector registers and instructions (e.g. AVX-512 and friends). More recently, they also added @tturbo
, (with an extra t
for “threaded”) (originally called @avxt
) which uses very low overhead threading from Polyester.jl to also multithread at the same time.
*Terms and conditions: must be reorderable, must not have branches, etc.
Some recent discussion of the SIMD part on Zulip: Zulip
Thanks @jroon for trying to simplify the code. Your made the code more simple and even faster. As for @brenhinkeller somehow I got the vague impression, that the @tturbo
marco doesn’t scale so so good on some systems, but I am not completely sure yet.
I try to explain, what this example code should be an abstraction of. Also which mistake I made.
Given there is a matrix of of heights. The question is, low long is every place overshadowed. Seams like a limited submatrix of places is enough to calculate the overshadowing of a given point. I try to prepare/prepocess some matrices to simplify the actual preparation. Which is:
Given a maximum height difference the size of submatrix is static. I have prepared a matrix of height differences. If a point of the sub-matrix is higher than the height of the overshadowed place + the height difference, means it will actually overshadow the place.
Now, when I want to look at the time the place is overshadowed, this is depending on the angle segments. Like the shadowing time in the middle south is lower than the time in the west or east.
And now we come to the part of the code that is missing: Every shadow segment can throw that show only once. Example: A building throws a shadow on me, I don’t care about the shadow behind.
So what is still needed/missing is to split the old codes line:
shadowed_segments[angle] = weights[angle] * overshadowed
to have a set of angles and add the angle if overshadowed evaluates to true. Than only in the end for every angle in the set of angles the Time segment of the angle segments can be calculated.
My current implementation: is about six, seven times slower than, @jroon 's . Also @turbo
does not not work on sets. So the current implementation is SUPER slow.
function computeSumHours2!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
height, width = size(height_mat)
short_elevations = @view enlarged_size[y:y+height, x:x+width]
angle_set = BitSet(0)
@inbounds for x2 ∈ 1:width
for y2 ∈ 1:height
angle = angle_mat[y2, x2]
@fastmath overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
union!(angle_set, angle*overshadowed)
end
end
result = zero(eltype(weights))
@inbounds for elem ∈ setdiff!(angle_set, 0)
@fastmath result += weights[elem]
end
return result
end
function computeAllLines5!(org_sized, enlarged_size, angle_mat, height_mat, shadow_time_hrs, weights)
for x ∈ 1:size(org_sized, 2) - 1
@threads for y ∈ 1:size(org_sized, 1) - 1
shadow_time_hrs[x, y] = computeSumHours2!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x)
end
end
return shadow_time_hrs
end
I am still trieing the improve the code for the additional restriction. The current code uses the improvements @jroon and @brenhinkeller added a while ago.
And adds an additional loop afterwards. I tried different implementations so far, but everyone had a different setback I did not understand.
I wanted to ask you about the problems of the latest two versions.
The latest version (the one with the fewest setbacks) is 25 % faster with @turbo
than with @tturbo
and even this version is 17 % slower than when I use @bounds instead. (Therefore I updated: LoopVectorization, didn’t change anything)
function computeSumHours3!(org_sized, enlarged_size, angle_mat, height_mat, weights, y, x, angle_set)
height, width = size(height_mat)
short_elevations = @view enlarged_size[y:y+height, x:x+width]
angle_set .= zero(eltype(angle_set))
for x2 ∈ 1:width
@inbounds for y2 ∈ 1:height
angle = angle_mat[y2, x2]
@fastmath overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
angle_set[angle] |= overshadowed
end
end
result = zero(eltype(weights));
@inbounds for elem ∈ 1:length(weights)
@fastmath result += weights[elem] * angle_set[elem];
end
return result
end
I added the parameter:
angle_set = zeros(Bool, length(weights))
The version I tried before:
0 height, width = size(height_mat)
0 short_elevations = @view enlarged_size[y:y+height, x:x+width]
0 angle_set .= zero(eltype(angle_set))
-
0 for x2 ∈ 1:width
0 @turbo for y2 ∈ 1:height
- angle = angle_mat[y2, x2]
- @fastmath overshadowed = (short_elevations[y2, x2] - org_sized[y, x]) > height_mat[y2, x2]
0 angle_set[angle] |= overshadowed
- end
- end
-
- result = zero(eltype(weights))
7456 angle_list = findall(angle_set .== true)
0 @inbounds for elem ∈ angle_list
0 @fastmath result += weights[elem]
- end
-
0 return result
- end
It fails when the second loop uses @turbo
because the iteration isn’t of the form: 1:length(…) and it uses tons of memory for `findall(angle_set .== true). Which I think is very strange, because the array has only 361 entries.
isn’t this just findall(angle_set)
?.. .== true
will allocate a new array for no reason.
Could also be written like this, which should be easier for the auto vectorizer to understand:
result = zero(eltype(weights))
for elem in eachindex(angle_set)
result += weights[elem] * angle_set[elem] # true/false is a number in julia
end
Looks like a classic case for mapreduce
though (or threaded equivalents).
superbe, reduces the memory usage somewhat from 7,4 kb to 2.95 kb. Still I don’t understand why, there is so much memory usage left. Does’t it mean, that there is something in between allocated saved as Int64?
If these were 64-bit numbers, data alone takes
julia> 361 * 8
2888 #bytes
and there’s some tag data:
julia> Base.summarysize(rand(361))
2928
So I don’t think allocation on this order of magnitude is unacceptable.