Efficient use of @turbo for linear algebra operations (LoopVectorization.jl)

fipelle · August 21, 2021, 10:46am

Hi,

I am trying to understand how to use the @turbo macro from LoopVectorization.jl efficiently.

I am starting with a generalised dot product, as implemented in LinearAlgebra. I basically took the code from LinearAlgebra and tried to apply the @turbo macro following the examples in its official documentation:

function turbo_dot(x::AbstractVector{Float64}, A::AbstractMatrix{Float64}, y::AbstractVector{Float64})
    s = 0.0;
    @turbo for j in eachindex(y)
        yj = y[j];
        if !iszero(yj)
            temp = 0.0;
            for i in eachindex(x)
                temp += x[i]*A[i,j];
            end
            s += temp*yj;
        end
    end
    return s;
end

However, I get a strange error message: nested task error: LoadError: LoadError: LoadError: Don't know how to handle expression.

Could someone please explain me what this means and how I should approach it? I noticed that if I use the @turbo macro with the inner-most loop it works fine. However, I do not understand why should I do it considering - for instance - that the example on matrix multiplications (Matrix Multiplication · LoopVectorization.jl) uses it on the outer-most loop.

fipelle · August 21, 2021, 11:03am

I forgot to mention that I noticed that by dropping the if-else statement the code works fine. Does it imply that if-else statements cannot be used with the @turbo macro?

learned_fool · August 21, 2021, 12:46pm

yes, code inside @turbo annotated expressions need to be free of branches; the iterations also need to be independent to each other.

DNF · August 21, 2021, 12:48pm

I don’t think @turbo handles branches, but it shouldn’t be a problem, since all the work happens in the inner loop anyway.

You could try multithreading the outer loop, though, either with Threads.@threads or with Polyester.@batch.

fipelle · August 21, 2021, 12:55pm

Thanks! My only concern with multi-threading is that I am already parallelising using a range of workers and I am not sure whether there is any scope left.

DNF · August 21, 2021, 1:18pm

Julia supports nested multithreading, not sure if it works with @batch, though, since it’s a different threading model.

But since I’m still on my phone here: does or doesn’t 1000^7 overflow? I’m still not sure.

Elrod · August 21, 2021, 3:04pm

Removing the branch, I get

julia> using LoopVectorization, LinearAlgebra

julia> function fastmath_dot(x::AbstractVector{Float64}, A::AbstractMatrix{Float64}, y::AbstractVector{Float64})
         s = 0.0;
         @fastmath for j in eachindex(y)
           yj = y[j];
           if !iszero(yj)
             temp = 0.0;
             for i in eachindex(x)
               temp += x[i]*A[i,j];
             end
             s += temp*yj;
           end
         end
         return s;
       end
fastmath_dot (generic function with 1 method)

julia> function turbo_dot(x::AbstractVector{Float64}, A::AbstractMatrix{Float64}, y::AbstractVector{Float64})
         s = 0.0;
         @turbo for j in eachindex(y), i in eachindex(x)
           s += x[i]*A[i,j]*y[j];
         end
         return s;
       end
turbo_dot (generic function with 1 method)

julia> x = rand(100);

julia> y = rand(100);

julia> A = rand(100,100);

julia> y[rand(axes(y,1), length(y) ÷ 2)] .= 0;

julia> @btime turbo_dot($x, $A, $y)
  639.714 ns (0 allocations: 0 bytes)
745.9084558196317

julia> @btime fastmath_dot($x, $A, $y)
  4.253 μs (0 allocations: 0 bytes)
745.9084558196316

julia> @btime dot($x, $A, $y)
  434.628 ns (0 allocations: 0 bytes)
745.9084558196316

julia> @btime turbo_dot($y, $A, $x)
  635.095 ns (0 allocations: 0 bytes)
743.8372158481377

julia> @btime fastmath_dot($y, $A, $x)
  6.860 μs (0 allocations: 0 bytes)
743.8372158481382

julia> @btime dot($y, $A, $x)
  781.343 ns (0 allocations: 0 bytes)
743.837215848138

julia> versioninfo()
Julia Version 1.8.0-DEV.383
Commit eb83c4d25b* (2021-08-20 20:03 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)

So which is faster depends on just how many zeros.
You can also try this, but it was slower for me at this size:

julia> function turbo_dot2(x::AbstractVector{Float64}, A::AbstractMatrix{Float64}, y::AbstractVector{Float64})
         s = 0.0;
         @turbo for j in eachindex(y)
           temp = 0.0;
           for i in eachindex(x)
             temp += x[i]*A[i,j];
           end
           s += temp*y[j];
         end
         return s;
       end
turbo_dot2 (generic function with 1 method)

julia> @btime turbo_dot2($y, $A, $x)
  663.428 ns (3 allocations: 64 bytes)
743.8372158481376

julia> @btime turbo_dot2($x, $A, $y)
  665.138 ns (3 allocations: 64 bytes)
745.9084558196316

Topic		Replies	Views
@turbo macro gives incorrect results Performance loopvectorization	4	458	October 26, 2022
Utilizing @turbo / @tturbo in Performance Critical Code Performance question	4	873	September 8, 2022
@turbo speeds routine, slows down everything else Performance loopvectorization	16	2610	June 5, 2021
@turbo macro giving slightly different results General Usage loopvectorization	6	505	January 28, 2023
@turbo on sets of operations Performance	15	1004	August 10, 2021

Efficient use of @turbo for linear algebra operations (LoopVectorization.jl)

Related topics