Memory allocation in dot product

Hello everyone.

Is it possible to perform a dot product between multidimensional arrays without allocating memory?

For example:

function funza(A)
    
return dot(A,A)
    
end
A = rand(5,5,5,5,5);

@btime funza(A)

  428.323 ns (1 allocation: 16 bytes)

1041.8233479785777

I suspect that the memory allocation comes from the fact that dot() creates a scalar variable to store the result, but maybe I’m wrong. I cannot find a dot! function in the julia documentation…

In any case, how can I avoid allocating such memory?

Additional and “optional” questions:

  • Is there a faster way to do this contraction (maybe using BLAS, unless the latter is used by default)?
  • How this can be done on the GPU?

Thanks a lot to everyone!

I think I found myself the answer in the documentation of https://github.com/JuliaSIMD/LoopVectorization.jl#dot-product

function mydotavx(A)
          s = 0.0
          @turbo for i ∈ eachindex(A)
              s += A[i]*A[i]
          end
          s
      end
@btime mydotavx($A)

  105.684 ns (0 allocations: 0 bytes)

1041.8233479785772
1 Like
julia> A = rand(5,5,5,5,5);

julia> @btime mydotavx($A)
  74.698 ns (0 allocations: 0 bytes)
1050.8122689068127

julia> @btime funza($A)
  130.944 ns (0 allocations: 0 bytes)
1050.8122689068127

I think it was allocating memory for you because you didn’t interpolate A:

@btime funza(A)

Still, LoopVectorization/hand written loops should be much faster than dot on most CPUs here because most of the CPU’s execution resources are spent on loading memory. A[i] * A[i] needs only a single load instead of two.

If you want multithreading, you’ll need @tturbo. For problems too small for multithreading, stick with @turbo.

2 Likes

Could LoopVectorization or LinearAlgebra transform dot(A,A) into a single-load version by recognizing the identical arguments?

Currently, LoopVectorization only has access to the information in the scope of the macro, therefore if given code like this:

function mydotavx_v2(A)
    B = A
    s = 0.0
    @turbo for i ∈ eachindex(A)
        s += A[i]*B[i]
    end
    s
end

it has no way of knowing that A === B. Maybe this can change once it switches to being a compiler plugin? Not sure what the scope of the analysis capabilities will be there.

In the above example, LoopVectorization’s code will involve two loads.
LLVM will, however, optimize it into a single load.

julia> @btime mydotavx_v2($A)
  114.125 ns (0 allocations: 0 bytes)
1050.8122689068127

julia> @code_native syntax=:intel debuginfo=:none mydotavx_v2(A)

the loop:

.LBB0_3:                                # %L43
                                        # =>This Inner Loop Header: Depth=1
        vmovupd zmm4, zmmword ptr [rax + 8*rdi]
        vmovupd zmm5, zmmword ptr [rax + 8*rdi + 64]
        vmovupd zmm6, zmmword ptr [rax + 8*rdi + 128]
        vmovupd zmm7, zmmword ptr [rax + 8*rdi + 192]
        vfmadd231pd     zmm3, zmm4, zmm4        # zmm3 = (zmm4 * zmm4) + zmm3
        vfmadd231pd     zmm2, zmm5, zmm5        # zmm2 = (zmm5 * zmm5) + zmm2
        vfmadd231pd     zmm0, zmm6, zmm6        # zmm0 = (zmm6 * zmm6) + zmm0
        vfmadd231pd     zmm1, zmm7, zmm7        # zmm1 = (zmm7 * zmm7) + zmm1
        add     rdi, 32
        cmp     rcx, rdi
        jne     .LBB0_3

4 loads and 4 fmas.

Nonetheless, performance is lost because LoopVectorization’s optimization decisions were made assuming that it has to do two loads per fma. As a result, it assumes less dependency cutting is needed versus the actual single load case that we have here.

1 Like

Maybe have some variation of norm() with p = 2 without applying the sqrt() operation (Maybe norm2sqr())?
Then the cases of a dot operation of a linear operator with itself will be efficiently calculated.

1 Like
LinearAlgebra.norm_sqr
1 Like

Too bad it can’t be found on the documentation (Where I looked for before posting).

Also:

help?> LinearAlgebra.norm_sqr
  No documentation found.

Moreover, strange result:

julia> vA = [2; 3]
2-element Vector{Int64}:
 2
 3

julia> LinearAlgebra.norm_sqr(vA)
12.999999999999998

There is no sqrt() applied, why the result isn’t straight 13?

1 Like

Since it is an internal undocumented function, I wouldn’t have too many expectations on exactly how it works.

That’s a tantalizing remark! For someone who knows next to nothing about ‘compiler plugins’, what would this look like from the user perspective?

2 Likes

This is what the implementation says:

# faster computation of norm(x)^2, avoiding overflow for integers
norm_sqr(x) = norm(x)^2

The overflow part is clear. How that can be faster is beyond my understanding. It certainly isn’t faster than a straightforward implementation, although admittedly the functions give different results when one approaches the overflow scales, and certainly the base function is what is closer to being correct.

julia> x = rand(1:1000,1000);

julia> @btime LinearAlgebra.norm_sqr($x)
  2.903 μs (0 allocations: 0 bytes)
3.5071326500000006e8

julia> function n2(x)
          n = zero(eltype(x))
          @simd for el in x
             n += el^2
          end
          n
       end
n2 (generic function with 1 method)

julia> @btime n2($x)
  162.360 ns (0 allocations: 0 bytes)
350713265

If you have AVX512, converting to floats can be faster:

julia> function n2f(x)
          n = zero(eltype(x))
          @simd for el in x
             @fastmath n += Float64(el)^2
          end
          n
       end
n2f (generic function with 1 method)

julia> function n2(x)
          n = zero(eltype(x))
          @simd for el in x
             n += el^2
          end
          n
       end
n2 (generic function with 1 method)

julia> x = rand(1:1000,1000);

julia> @btime n2f($x)
  57.579 ns (0 allocations: 0 bytes)
3.36560308e8

julia> @btime n2($x)
  79.610 ns (0 allocations: 0 bytes)
336560308

If you don’t, it’ll be slower. @turbo will do much better on converting Int64 to Float64 than @simd @fastmath if you don’t have AVX512, but it’d probably still end up a fair bit slower than the non-converting version.

I’d also love an answer to that question.
I’m rewriting LoopVectorization in three parts, in the following order:

  1. Loop representation and optimization.
  2. Generating code that can be compiled and run from an optimized schedule.
  3. Creating a representation from code.

So I’ll wait to start diving into compiler plugins when I actually get to “3.”, which is progressing at a glacial pace so far. Hopefully the work will have advanced and we’ll have some combination of documentation to look at / examples to look at / pandemic dies down so I can more easily ask people (i.e. Keno) questions.

3 Likes