Calculating "triple" dot products

I have Float64 arrays x, y, z and want to compute sum(x[i] * y[i] * z[i]) over all indices i.

I’m curious what is the fastest way to calculate this?

Some ideas:

  • a plain for loop
  • Einsum.jl
  • OMEinsum.jl
  • LinearAlgebra.dot calls into BLAS.dot, but I’m not sure if there are a 3-vector version of this though.

Have you already tried some approach that we can compare to?

A straightforward implementation:

function dot3(x, y, z)
  s = zero(promote_type(eltype(x), eltype(y), eltype(z)))
  @simd for i in eachindex(x, y, z)
    s += x[i] * y[i] * z[i]
  end
  return s
end

I am getting

julia> x = rand(1000); y = rand(1000); z = rand(1000);

julia> @btime dot3($x, $y, $z);
  118.453 ns (0 allocations: 0 bytes)
2 Likes

What are the cases in which this is useful?

What about something more Julian like

sum(x .* y .* z)
julia> x = rand(1000); y = rand(1000); z = rand(1000);

julia> @btime sum(x .* y .* z);
  253.994 ns (6 allocations: 7.95 KiB)

This @simd for version by barucden above is probably approximately as fast as you can get. There’s some chance it does some suboptimal vectorization and that something based on LoopVectorization (or another) might eek out some more, but I doubt it will make a big difference to your overall program. I suspect that computing x, y, and z in the first place costs significantly more than this sum ever will (for example, even @btime fill!($x, 1); costs ~40% of what the sum does).

There just isn’t much optimization that one can do for simple 1-loop calculations.

You can do something simpler like mapreduce(*, +, x, y, z) or sum(prod, zip(x, y, z)), but those have some open performance issues right now so aren’t competitive with the @simd for implementation.

If you supply more than one AbstractArray argument, eachindex will create an iterable object that is fast for all arguments

I would not call that more Julian. It is very Julian to avoid unnecessary intermediate allocations, so I would avoid this.

I asked because for the input example provided with Vector{Float64}, using only eachindex(x) gives exactly the same performance.

It’s safer in terms of bound-checking:

julia> x = rand(3); y = rand(2);

julia> for i in eachindex(x); @show x[i] + y[i] end
x[i] + y[i] = 1.0647502606457395
x[i] + y[i] = 1.0344221061391612
ERROR: BoundsError: attempt to access 2-element Vector{Float64} at index [3]

julia> for i in eachindex(x, y); @show x[i] + y[i] end
ERROR: DimensionMismatch: all inputs to eachindex must have the same indices, got Base.OneTo(3) and Base.OneTo(2)
2 Likes

x, y and z are all similar, so of course there’s no difference. The point is if they are different in some way, especially to catch size differences, but also just to find a workable iterator for all of them.

I have read the help docstring but I do not have a good picture of practical examples:

help?> eachindex
...
If you supply more than one AbstractArray argument, eachindex will create an iterable object that is fast for all    
  arguments (typically a UnitRange if all inputs have fast linear indexing, a CartesianIndices otherwise).