Broadcasting performance

Lincoln_Hannah · December 3, 2024, 6:56am

f(x) = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)
X    = randn(100_000_000)

@time f.(X)                          # 4  seconds  
@time @threads for x in X; f(x) end  # .8 seconds  16 threads
@time @turbo f.(X)                   # .4 seconds   
@time f.(ROCArray(X))                # .1 seconds  AMD GPU

@threads uses the whole CPU but @turbo is faster. Is there a way to get the speed up of @turbo plus use the whole CPU ?
GPU is super fast but ends up being slower if you include the time to move to the GPU and back to CPU.

tomaklutfu · December 3, 2024, 7:51am

You should try @threads inside a function.

julia> f(x) = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)
f (generic function with 1 method)

julia> X    = randn(100_000_000);

julia> Y    = similar(X);

julia> function farr!(y, x)
           y .= f.(x)
           return nothing
       end
farr! (generic function with 1 method)

julia> function farr_th!(y, x)
           Threads.@threads for i in eachindex(x)
               @inbounds y[i] = f(x[i])
           end
           return nothing
       end
farr_th! (generic function with 1 method)

julia> xc = CuArray(X);

julia> yc = CuArray(Y);

julia> @time farr!(Y, X) #Broadcast in CPU
  6.336066 seconds (68.31 k allocations: 3.622 MiB, 0.36% compilation time)

julia> @time farr!(Y, X) #Broadcast in CPU
  7.044405 seconds

julia> @time farr_th!(Y, X) #4 Threads in CPU
  2.102914 seconds (35.83 k allocations: 1.827 MiB, 5.03% compilation time)

julia> @time farr_th!(Y, X) #4 Threads in CPU
  2.172290 seconds (22 allocations: 2.156 KiB)

julia> @time farr!(yc, xc) #Broadcast in Cuda GPU
  0.198106 seconds (69.00 k allocations: 4.755 MiB, 8.38% compilation time)

julia> @time farr!(yc, xc) #Broadcast in Cuda GPU
  0.000071 seconds (42 allocations: 736 bytes)

Though @turbo is faster than plain @threads

julia> function farr_turbo!(y, x)
           @turbo y .= f.(x)
           return nothing
       end
farr_turbo! (generic function with 1 method)

julia> @time farr_turbo!(Y, X) #Broadcast in CPU turbo
  2.754020 seconds (3.65 M allocations: 175.448 MiB, 1.31% gc time, 79.22% compilation time)

julia> @time farr_turbo!(Y, X) #Broadcast in CPU turbo
  0.602359 seconds

GPU test with move

julia> @time farr!(CuArray(Y), CuArray(X)) #Broadcast in Cuda GPU and include allocation
  1.474983 seconds (1.76 M allocations: 94.184 MiB, 75.76% compilation time: 24% of which was recompilation)

julia> @time farr!(CuArray(Y), CuArray(X)) #Broadcast in Cuda GPU and include allocation
  0.369701 seconds (62 allocations: 1.281 KiB, 8.23% gc time)

roflmaostc · December 3, 2024, 8:24am

Better you test with CUDA.@time farr!(yc, xc). Or best even @btime CUDA.@sync farr!($yc, $xc). It looks like 71µs is too fast and merely measures the kernel launch.

For those kind of operations, also try Tullio.jl. It works with threads and without LoopVectorization.jl (if you load both, Tullio.jl will also use LV!).

Also, LoopVectorization.@tturbo does threading, fastmath, SIMD and some cache optimization. So it’s hard to beat naively.

Zentrik · December 3, 2024, 8:32am

Use @tturbo or @turbo thread=true to enable threading.

DNF · December 3, 2024, 9:19am

Wow, I had no idea this works. The manual is a bit unclear on this point, appearing to say juxtaposition works for numeric literals only. Integers and Floating-Point Numbers · The Julia Language

One can sort of guess it might work based on:

Additionally, parenthesized expressions can be used as coefficients to variables, implying multiplication of the expression by the variable

but it’s still a bit surprising to see the above expression, to me.

Lincoln_Hannah · December 3, 2024, 9:48am

As far as I know
It will multiply by anything after a close bracket ) except an open bracket which will be treated like a function. And you can put a number before anything. 10(1+1)sin(1)

Lincoln_Hannah · December 3, 2024, 7:42pm

Thanks @tturbo is awesome:)

Lincoln_Hannah · December 3, 2024, 10:56pm

Could @tturbo be used with TidierData ? something like

@chain begin
    
    DataFrame(X=X)
    
    @tturbo @mutate X2 = f(X)

end

photor · December 5, 2024, 2:39am

You should use @fastmath to compare with @turbo

kdpsingh · December 29, 2024, 6:24pm

Yes - sorry it took me a while to see this.

If you define your function as:

f(x) = @tturbo sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)

Then, to work with TidierData, all you need to do is:

@chain begin
    DataFrame(X=X)   
    @mutate X2 = f(X)
end

For awareness, the expression X2 = f(X) gets vectorized to X2 = f.(X) by TidierData before it is evaluated. Vectorization here is appropriate because f(x) is designed to operate on scalars. I haven’t benchmarked this, but it is totally doable.

If you want to keep the function f(x) defined as-is without including @tturbo in its definition, then another way to use this within TidierData is the following:

@chain begin
    DataFrame(X=X)
    @mutate(across(X, x -> @tturbo f.(x)))
end

Lincoln_Hannah · January 6, 2025, 6:35am

Runs faster without the input type specification. I expected the other way round. Can anyone explain ?

X             = randn(50_000_000)
f(x::Float64) = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)
h(x)          = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)

@time @tturbo f.(X)     # 2.0  seconds
@time @tturbo h.(X)     # 0.05 seconds

DNF · January 6, 2025, 8:11am

I would have expected them to be identical. Why would you expect a difference?

I’m seeing similar differences, btw, or actually, just an 8x difference. (Julia v1.11.2)

DNF · January 6, 2025, 8:18am

So, this is what is happening here, though I do not know why:

julia> @turbo h.(X);  # works fine

julia> @turbo f.(X);
┌ Warning: #= line 0 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ LoopVectorization C:\Users\dnf\.julia\packages\LoopVectorization\XbdFG\src\condense_loopset.jl:1148

So loop vectorization actually fails for f, but not for h. Seems like a bug.

Lincoln_Hannah · January 6, 2025, 9:39am

to narrow it down

R( x::Real)          = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)
AF(x::AbstractFloat) = sin(x)cos(x)tan(x)exp(x)sin(x)cos(x)tan(x)exp(x)

@time @tturbo  R.(X)    # seems to use LoopVectorization
@time @tturbo AF.(X)    # seems to not use

github.com/JuliaSIMD/LoopVectorization.jl

@turbo not doing LoopVectorization on Float64

opened 10:59PM - 06 Jan 25 UTC

Lincoln-Hannah

Seems to fail for Float64 and AbstractFloat but works for real and more general …types. Also, should it be able to work on structs as in the X1 example at the bottom. ``` X = randn(50_000_000) float64( x::Float64 ) = sin(x) abs_float( x::AbstractFloat ) = sin(x) real( x::Real ) = sin(x) any( x::Any ) = sin(x) @turbo float64.(X) # LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead @turbo abs_float.(X) # LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead @turbo real.(X) # works @turbo any.(X) # works struct X1; x::Float64 end fx1((;x)::X1) = sin(x) x1 = repeat([X1(1.0)],1000_000) @turbo fx1.( x1 ) # LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead ```

Topic		Replies	Views
@turbo on sets of operations Performance	15	1001	August 10, 2021
Arithmetic broadcasting in Julia 5x slower than MATLAB Performance	17	1070	May 26, 2022
What the differences of the 6 solutions please Performance	6	320	March 31, 2024
Arithmetic performance of expression Performance	11	371	October 4, 2022
Need help on improving performance of a small subfunction Performance	21	889	May 4, 2020

Broadcasting performance

Related topics