Broadcasting performance

f(x) = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)
X    = randn(100_000_000)

@time f.(X)                          # 4  seconds  
@time @threads for x in X; f(x) end  # .8 seconds  16 threads
@time @turbo f.(X)                   # .4 seconds   
@time f.(ROCArray(X))                # .1 seconds  AMD GPU 

@threads uses the whole CPU but @turbo is faster. Is there a way to get the speed up of @turbo plus use the whole CPU ?
GPU is super fast but ends up being slower if you include the time to move to the GPU and back to CPU.

2 Likes

You should try @threads inside a function.

julia> f(x) = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)
f (generic function with 1 method)

julia> X    = randn(100_000_000);

julia> Y    = similar(X);

julia> function farr!(y, x)
           y .= f.(x)
           return nothing
       end
farr! (generic function with 1 method)

julia> function farr_th!(y, x)
           Threads.@threads for i in eachindex(x)
               @inbounds y[i] = f(x[i])
           end
           return nothing
       end
farr_th! (generic function with 1 method)

julia> xc = CuArray(X);

julia> yc = CuArray(Y);

julia> @time farr!(Y, X) #Broadcast in CPU
  6.336066 seconds (68.31 k allocations: 3.622 MiB, 0.36% compilation time)

julia> @time farr!(Y, X) #Broadcast in CPU
  7.044405 seconds

julia> @time farr_th!(Y, X) #4 Threads in CPU
  2.102914 seconds (35.83 k allocations: 1.827 MiB, 5.03% compilation time)

julia> @time farr_th!(Y, X) #4 Threads in CPU
  2.172290 seconds (22 allocations: 2.156 KiB)

julia> @time farr!(yc, xc) #Broadcast in Cuda GPU
  0.198106 seconds (69.00 k allocations: 4.755 MiB, 8.38% compilation time)

julia> @time farr!(yc, xc) #Broadcast in Cuda GPU
  0.000071 seconds (42 allocations: 736 bytes)

Though @turbo is faster than plain @threads

julia> function farr_turbo!(y, x)
           @turbo y .= f.(x)
           return nothing
       end
farr_turbo! (generic function with 1 method)

julia> @time farr_turbo!(Y, X) #Broadcast in CPU turbo
  2.754020 seconds (3.65 M allocations: 175.448 MiB, 1.31% gc time, 79.22% compilation time)

julia> @time farr_turbo!(Y, X) #Broadcast in CPU turbo
  0.602359 seconds

GPU test with move

julia> @time farr!(CuArray(Y), CuArray(X)) #Broadcast in Cuda GPU and include allocation
  1.474983 seconds (1.76 M allocations: 94.184 MiB, 75.76% compilation time: 24% of which was recompilation)

julia> @time farr!(CuArray(Y), CuArray(X)) #Broadcast in Cuda GPU and include allocation
  0.369701 seconds (62 allocations: 1.281 KiB, 8.23% gc time)

Better you test with CUDA.@time farr!(yc, xc). Or best even @btime CUDA.@sync farr!($yc, $xc). It looks like 71µs is too fast and merely measures the kernel launch.

For those kind of operations, also try Tullio.jl. It works with threads and without LoopVectorization.jl (if you load both, Tullio.jl will also use LV!).

Also, LoopVectorization.@tturbo does threading, fastmath, SIMD and some cache optimization. So it’s hard to beat naively.

3 Likes

Use @tturbo or @turbo thread=true to enable threading.

5 Likes

Wow, I had no idea this works. The manual is a bit unclear on this point, appearing to say juxtaposition works for numeric literals only. Integers and Floating-Point Numbers · The Julia Language

One can sort of guess it might work based on:

Additionally, parenthesized expressions can be used as coefficients to variables, implying multiplication of the expression by the variable

but it’s still a bit surprising to see the above expression, to me.

1 Like

As far as I know
It will multiply by anything after a close bracket ) except an open bracket which will be treated like a function. And you can put a number before anything. 10(1+1)sin(1)

Thanks @tturbo is awesome:)

Could @tturbo be used with TidierData ? something like

@chain begin
    
    DataFrame(X=X)
    
    @tturbo @mutate X2 = f(X)

end
2 Likes

You should use @fastmath to compare with @turbo

Yes - sorry it took me a while to see this.

If you define your function as:

f(x) = @tturbo sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)

Then, to work with TidierData, all you need to do is:

@chain begin
    DataFrame(X=X)   
    @mutate X2 = f(X)
end

For awareness, the expression X2 = f(X) gets vectorized to X2 = f.(X) by TidierData before it is evaluated. Vectorization here is appropriate because f(x) is designed to operate on scalars. I haven’t benchmarked this, but it is totally doable.

If you want to keep the function f(x) defined as-is without including @tturbo in its definition, then another way to use this within TidierData is the following:

@chain begin
    DataFrame(X=X)
    @mutate(across(X, x -> @tturbo f.(x)))
end
2 Likes

Runs faster without the input type specification. I expected the other way round. Can anyone explain ?

X             = randn(50_000_000)
f(x::Float64) = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)
h(x)          = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)

@time @tturbo f.(X)     # 2.0  seconds
@time @tturbo h.(X)     # 0.05 seconds
1 Like

I would have expected them to be identical. Why would you expect a difference?

I’m seeing similar differences, btw, or actually, just an 8x difference. (Julia v1.11.2)

2 Likes

So, this is what is happening here, though I do not know why:

julia> @turbo h.(X);  # works fine

julia> @turbo f.(X);
┌ Warning: #= line 0 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ LoopVectorization C:\Users\dnf\.julia\packages\LoopVectorization\XbdFG\src\condense_loopset.jl:1148

So loop vectorization actually fails for f, but not for h. Seems like a bug.

3 Likes

to narrow it down

R( x::Real)          = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)
AF(x::AbstractFloat) = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)

@time @tturbo  R.(X)    # seems to use LoopVectorization
@time @tturbo AF.(X)    # seems to not use 
1 Like