Yes, hence also should work with 1.11
Tullio.jl would use LV if it was loaded, otherwise no.
Yes, hence also should work with 1.11
Tullio.jl would use LV if it was loaded, otherwise no.
But Loopvectorization.jl does work with 1.11. Or, at least I had that impression.
Oh yeah, seems like the 1.11 issues have been solved:
Itβs future is still unclear but at least there is some hope!
Yes, but the benchmarks for Tullio are not great without LV. If I look at my example, I see that
@bs (@tullio threads=false $out[i, j] = $A[i-l, j-k] * $kernel[l, k]) seconds=1
gives essentially the same results of f0!
, because Tullio is using LV to do the work. However, if I disable LV
@bs (@tullio threads=false avx=false $out[i, j] = $A[i-l, j-k] * $kernel[l, k]) seconds=1
then Tullio is almost ten times slower (therefore, worse than my naive implementation). Therefore, in my very specific use case, Tullio does not seem to provide any advantage. I will check, however, if it helps with GPU (although the GPU support in Tullio is experimental).
For my own reference, and with the hope that it can help other people, I am writing here what I just found from an unrelated question replied to by @Elrod.
With the help of @turbo_debug
one can know what LV is doing. Redefine the function as
function f0_debug!(out::AbstractArray{T,N}, A::AbstractArray{S,N}, kernel::AbstractArray{K,N}) where {T,S,K,N}
LoopVectorization.@turbo_debug for J β CartesianIndices(out)
tmp = zero(eltype(out))
for I β CartesianIndices(kernel)
tmp += A[J-I] * kernel[I]
end
out[J] = tmp
end
end
and then call
julia> ls = f0_debug!(out, A, kernel);
julia> LoopVectorization.choose_order(ls)
([Symbol("J#2#"), Symbol("J#1#"), Symbol("I#2#"), Symbol("I#1#")], Symbol("I#1#"), Symbol("J#2#"), Symbol("J#1#"), 1, 6)
This means that
J[2], J[1], I[2], I[1]
I[1]
and J[2]
, are the unrolled loops. The first is unrolled by 1, the second by 6 (the two integers values at the end of the list).J[1]
is the vectorized loopThere is also a related function LoopVectorization.choose_order_cost
, which also provides some βcostβ for the loop and a boolean indicating if LV will inline.