Calculating "triple" dot products

kevinli1993 · January 16, 2025, 3:52pm

I have Float64 arrays x, y, z and want to compute sum(x[i] * y[i] * z[i]) over all indices i.

I’m curious what is the fastest way to calculate this?

Some ideas:

a plain for loop
Einsum.jl
OMEinsum.jl
LinearAlgebra.dot calls into BLAS.dot, but I’m not sure if there are a 3-vector version of this though.

barucden · January 16, 2025, 4:17pm

Have you already tried some approach that we can compare to?

A straightforward implementation:

function dot3(x, y, z)
  s = zero(promote_type(eltype(x), eltype(y), eltype(z)))
  @simd for i in eachindex(x, y, z)
    s += x[i] * y[i] * z[i]
  end
  return s
end

I am getting

julia> x = rand(1000); y = rand(1000); z = rand(1000);

julia> @btime dot3($x, $y, $z);
  118.453 ns (0 allocations: 0 bytes)

rafael.guerra · January 16, 2025, 5:29pm

What are the cases in which this is useful?

NonDairyNeutrino · January 16, 2025, 5:43pm

What about something more Julian like

sum(x .* y .* z)

julia> x = rand(1000); y = rand(1000); z = rand(1000);

julia> @btime sum(x .* y .* z);
  253.994 ns (6 allocations: 7.95 KiB)

mikmoore · January 16, 2025, 6:02pm

This @simd for version by barucden above is probably approximately as fast as you can get. There’s some chance it does some suboptimal vectorization and that something based on LoopVectorization (or another) might eek out some more, but I doubt it will make a big difference to your overall program. I suspect that computing x, y, and z in the first place costs significantly more than this sum ever will (for example, even @btime fill!($x, 1); costs ~40% of what the sum does).

There just isn’t much optimization that one can do for simple 1-loop calculations.

You can do something simpler like mapreduce(*, +, x, y, z) or sum(prod, zip(x, y, z)), but those have some open performance issues right now so aren’t competitive with the @simd for implementation.

barucden · January 16, 2025, 6:02pm

If you supply more than one AbstractArray argument, eachindex will create an iterable object that is fast for all arguments

DNF · January 16, 2025, 6:17pm

I would not call that more Julian. It is very Julian to avoid unnecessary intermediate allocations, so I would avoid this.

rafael.guerra · January 16, 2025, 6:47pm

I asked because for the input example provided with Vector{Float64}, using only eachindex(x) gives exactly the same performance.

bertschi · January 16, 2025, 6:57pm

It’s safer in terms of bound-checking:

julia> x = rand(3); y = rand(2);

julia> for i in eachindex(x); @show x[i] + y[i] end
x[i] + y[i] = 1.0647502606457395
x[i] + y[i] = 1.0344221061391612
ERROR: BoundsError: attempt to access 2-element Vector{Float64} at index [3]

julia> for i in eachindex(x, y); @show x[i] + y[i] end
ERROR: DimensionMismatch: all inputs to eachindex must have the same indices, got Base.OneTo(3) and Base.OneTo(2)

DNF · January 16, 2025, 9:21pm

x, y and z are all similar, so of course there’s no difference. The point is if they are different in some way, especially to catch size differences, but also just to find a workable iterator for all of them.

rafael.guerra · January 16, 2025, 9:33pm

I have read the help docstring but I do not have a good picture of practical examples:

help?> eachindex
...
If you supply more than one AbstractArray argument, eachindex will create an iterable object that is fast for all    
  arguments (typically a UnitRange if all inputs have fast linear indexing, a CartesianIndices otherwise).

kevinli1993 · January 22, 2025, 2:32pm

Thanks all - the responses are great and I learned more about Julia in the process.

nsajko · January 22, 2025, 2:43pm

@rafael.guerra an important aspect is that eachindex throws when there’s nothing reasonable to return:

julia> eachindex(3:5, 3:7)
ERROR: DimensionMismatch: all inputs to eachindex must have the same indices, got Base.OneTo(3) and Base.OneTo(5)

Putting a bounds check before a loop helps eliminate bounds checks within the loop body. eachindex throwing makes the eachindex call a bounds check.

rafael.guerra · January 22, 2025, 2:58pm

@nsajko, thank you. I think @bertschi and others were pointing to that same advantage of the multiple arrays arguments to eachindex.
I was just trying to better understand the help doc string.
Also the bound checks do not seem to impact the performance for the example provided.

Benchmark code example

function dotproduct3a(x, y, z)
    s = zero(eltype(x))
    @simd for i in eachindex(x, y, z)       #? note multiple argument eachindex()
        s += x[i]*y[i]*z[i]
    end
    return s
end

function dotproduct3b(x, y, z)
    s = zero(eltype(x))
    @simd for i in eachindex(x)             #? note single argument eachindex()
        s += x[i]*y[i]*z[i]
    end
    return s
end

n = 100_000
x = rand(n); y = rand(n); z = rand(n);

using BenchmarkTools
@btime dotproduct3a($x, $y, $z)         # 30.6 μs (0 allocs: 0 bytes)
@btime dotproduct3b($x, $y, $z)         # 30.6 μs (0 allocs: 0 bytes)

versioninfo()

Julia Version 1.11.2
Commit 5e9a32e7af (2024-12-01 20:02 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 20 × 13th Gen Intel(R) Core™ i9-13900H
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)
Environment:
JULIA_PKG_USE_CLI_GIT = true
JULIA_STACKFRAME_FUNCTION_COLOR = blue
JULIA_WARN_COLOR = cyan
JULIA_EDITOR = code.cmd -g
JULIA_NUM_THREADS = 8

nsajko · January 22, 2025, 3:29pm

OK, here’s an example where passing all collections to eachindex is important for safety. The above example uses @simd, which transforms the code of the loop, and I don’t know what transformations exactly does @simd do, so in this example I replaced @simd with @fastmath and @inbounds. This keeps the same performance but is easier to understand and discuss, given that there’s no magic code generation:

function dot3(x, y, z)
    s = zero(promote_type(eltype(x), eltype(y), eltype(z)))
    @inline @fastmath for i in eachindex(x, y, z)
        s += @inbounds x[i] * y[i] * z[i]
    end
    s
end

In this example, it’s important that we have eachindex(x, y, z) instead of eachindex(x), because we want an out-of-bounds access to result in a throw, instead of in UB.

With @simd it appears that eachindex(x) is adequate, at least with the current implementation of @simd, but @simd is a bit of a spooky black box, not something to base best practices on.

Topic		Replies	Views
Optimizing sums of products (dot products) Performance simd , linearalgebra , sum	17	552	September 24, 2024
Diagonal elements of matrix product General Usage question , performance , linearalgebra	12	2568	November 2, 2022
Optimize dot(::Vector{Float64}, ::Vector{ForwardDiff.Dual}) General Usage question	3	327	January 27, 2020
Implement an efficient dot-product-like algorithm for two or more vectors? General Usage	4	447	January 14, 2019
Question: how to get dot products from size(x, n) matrix and length(n) vector Statistics	3	349	June 23, 2022

Calculating "triple" dot products

Related topics