I suspect that the memory allocation comes from the fact that dot() creates a scalar variable to store the result, but maybe I’m wrong. I cannot find a dot! function in the julia documentation…
In any case, how can I avoid allocating such memory?
Additional and “optional” questions:
Is there a faster way to do this contraction (maybe using BLAS, unless the latter is used by default)?
I think it was allocating memory for you because you didn’t interpolate A:
@btime funza(A)
Still, LoopVectorization/hand written loops should be much faster than dot on most CPUs here because most of the CPU’s execution resources are spent on loading memory. A[i] * A[i] needs only a single load instead of two.
If you want multithreading, you’ll need @tturbo. For problems too small for multithreading, stick with @turbo.
Currently, LoopVectorization only has access to the information in the scope of the macro, therefore if given code like this:
function mydotavx_v2(A)
B = A
s = 0.0
@turbo for i ∈ eachindex(A)
s += A[i]*B[i]
end
s
end
it has no way of knowing that A === B. Maybe this can change once it switches to being a compiler plugin? Not sure what the scope of the analysis capabilities will be there.
In the above example, LoopVectorization’s code will involve two loads.
LLVM will, however, optimize it into a single load.
Nonetheless, performance is lost because LoopVectorization’s optimization decisions were made assuming that it has to do two loads per fma. As a result, it assumes less dependency cutting is needed versus the actual single load case that we have here.
Maybe have some variation of norm() with p = 2 without applying the sqrt() operation (Maybe norm2sqr())?
Then the cases of a dot operation of a linear operator with itself will be efficiently calculated.
# faster computation of norm(x)^2, avoiding overflow for integers
norm_sqr(x) = norm(x)^2
The overflow part is clear. How that can be faster is beyond my understanding. It certainly isn’t faster than a straightforward implementation, although admittedly the functions give different results when one approaches the overflow scales, and certainly the base function is what is closer to being correct.
julia> x = rand(1:1000,1000);
julia> @btime LinearAlgebra.norm_sqr($x)
2.903 μs (0 allocations: 0 bytes)
3.5071326500000006e8
julia> function n2(x)
n = zero(eltype(x))
@simd for el in x
n += el^2
end
n
end
n2 (generic function with 1 method)
julia> @btime n2($x)
162.360 ns (0 allocations: 0 bytes)
350713265
If you have AVX512, converting to floats can be faster:
julia> function n2f(x)
n = zero(eltype(x))
@simd for el in x
@fastmath n += Float64(el)^2
end
n
end
n2f (generic function with 1 method)
julia> function n2(x)
n = zero(eltype(x))
@simd for el in x
n += el^2
end
n
end
n2 (generic function with 1 method)
julia> x = rand(1:1000,1000);
julia> @btime n2f($x)
57.579 ns (0 allocations: 0 bytes)
3.36560308e8
julia> @btime n2($x)
79.610 ns (0 allocations: 0 bytes)
336560308
If you don’t, it’ll be slower. @turbo will do much better on converting Int64 to Float64 than @simd @fastmath if you don’t have AVX512, but it’d probably still end up a fair bit slower than the non-converting version.
I’d also love an answer to that question.
I’m rewriting LoopVectorization in three parts, in the following order:
Loop representation and optimization.
Generating code that can be compiled and run from an optimized schedule.
Creating a representation from code.
So I’ll wait to start diving into compiler plugins when I actually get to “3.”, which is progressing at a glacial pace so far. Hopefully the work will have advanced and we’ll have some combination of documentation to look at / examples to look at / pandemic dies down so I can more easily ask people (i.e. Keno) questions.