Calculating "triple" dot products

I have Float64 arrays x, y, z and want to compute sum(x[i] * y[i] * z[i]) over all indices i.

I’m curious what is the fastest way to calculate this?

Some ideas:

  • a plain for loop
  • Einsum.jl
  • OMEinsum.jl
  • LinearAlgebra.dot calls into BLAS.dot, but I’m not sure if there are a 3-vector version of this though.

Have you already tried some approach that we can compare to?

A straightforward implementation:

function dot3(x, y, z)
  s = zero(promote_type(eltype(x), eltype(y), eltype(z)))
  @simd for i in eachindex(x, y, z)
    s += x[i] * y[i] * z[i]
  end
  return s
end

I am getting

julia> x = rand(1000); y = rand(1000); z = rand(1000);

julia> @btime dot3($x, $y, $z);
  118.453 ns (0 allocations: 0 bytes)
3 Likes

What are the cases in which this is useful?

What about something more Julian like

sum(x .* y .* z)
julia> x = rand(1000); y = rand(1000); z = rand(1000);

julia> @btime sum(x .* y .* z);
  253.994 ns (6 allocations: 7.95 KiB)

This @simd for version by barucden above is probably approximately as fast as you can get. There’s some chance it does some suboptimal vectorization and that something based on LoopVectorization (or another) might eek out some more, but I doubt it will make a big difference to your overall program. I suspect that computing x, y, and z in the first place costs significantly more than this sum ever will (for example, even @btime fill!($x, 1); costs ~40% of what the sum does).

There just isn’t much optimization that one can do for simple 1-loop calculations.

You can do something simpler like mapreduce(*, +, x, y, z) or sum(prod, zip(x, y, z)), but those have some open performance issues right now so aren’t competitive with the @simd for implementation.

1 Like

If you supply more than one AbstractArray argument, eachindex will create an iterable object that is fast for all arguments

1 Like

I would not call that more Julian. It is very Julian to avoid unnecessary intermediate allocations, so I would avoid this.

I asked because for the input example provided with Vector{Float64}, using only eachindex(x) gives exactly the same performance.

It’s safer in terms of bound-checking:

julia> x = rand(3); y = rand(2);

julia> for i in eachindex(x); @show x[i] + y[i] end
x[i] + y[i] = 1.0647502606457395
x[i] + y[i] = 1.0344221061391612
ERROR: BoundsError: attempt to access 2-element Vector{Float64} at index [3]

julia> for i in eachindex(x, y); @show x[i] + y[i] end
ERROR: DimensionMismatch: all inputs to eachindex must have the same indices, got Base.OneTo(3) and Base.OneTo(2)
3 Likes

x, y and z are all similar, so of course there’s no difference. The point is if they are different in some way, especially to catch size differences, but also just to find a workable iterator for all of them.

1 Like

I have read the help docstring but I do not have a good picture of practical examples:

help?> eachindex
...
If you supply more than one AbstractArray argument, eachindex will create an iterable object that is fast for all    
  arguments (typically a UnitRange if all inputs have fast linear indexing, a CartesianIndices otherwise).

Thanks all - the responses are great and I learned more about Julia in the process.

1 Like

@rafael.guerra an important aspect is that eachindex throws when there’s nothing reasonable to return:

julia> eachindex(3:5, 3:7)
ERROR: DimensionMismatch: all inputs to eachindex must have the same indices, got Base.OneTo(3) and Base.OneTo(5)

Putting a bounds check before a loop helps eliminate bounds checks within the loop body. eachindex throwing makes the eachindex call a bounds check.

1 Like

@nsajko, thank you. I think @bertschi and others were pointing to that same advantage of the multiple arrays arguments to eachindex.
I was just trying to better understand the help doc string.
Also the bound checks do not seem to impact the performance for the example provided.

Benchmark code example
function dotproduct3a(x, y, z)
    s = zero(eltype(x))
    @simd for i in eachindex(x, y, z)       #? note multiple argument eachindex()
        s += x[i]*y[i]*z[i]
    end
    return s
end

function dotproduct3b(x, y, z)
    s = zero(eltype(x))
    @simd for i in eachindex(x)             #? note single argument eachindex()
        s += x[i]*y[i]*z[i]
    end
    return s
end

n = 100_000
x = rand(n); y = rand(n); z = rand(n);

using BenchmarkTools
@btime dotproduct3a($x, $y, $z)         # 30.6 μs (0 allocs: 0 bytes)
@btime dotproduct3b($x, $y, $z)         # 30.6 μs (0 allocs: 0 bytes)
versioninfo()

Julia Version 1.11.2
Commit 5e9a32e7af (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 20 × 13th Gen Intel(R) Core™ i9-13900H
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)
Environment:
JULIA_PKG_USE_CLI_GIT = true
JULIA_STACKFRAME_FUNCTION_COLOR = blue
JULIA_WARN_COLOR = cyan
JULIA_EDITOR = code.cmd -g
JULIA_NUM_THREADS = 8

1 Like

OK, here’s an example where passing all collections to eachindex is important for safety. The above example uses @simd, which transforms the code of the loop, and I don’t know what transformations exactly does @simd do, so in this example I replaced @simd with @fastmath and @inbounds. This keeps the same performance but is easier to understand and discuss, given that there’s no magic code generation:

function dot3(x, y, z)
    s = zero(promote_type(eltype(x), eltype(y), eltype(z)))
    @inline @fastmath for i in eachindex(x, y, z)
        s += @inbounds x[i] * y[i] * z[i]
    end
    s
end

In this example, it’s important that we have eachindex(x, y, z) instead of eachindex(x), because we want an out-of-bounds access to result in a throw, instead of in UB.

With @simd it appears that eachindex(x) is adequate, at least with the current implementation of @simd, but @simd is a bit of a spooky black box, not something to base best practices on.

1 Like