Understanding Tullio performances

Hi everyone! I’m new to Julia and I’m trying to translate a JAX code in Julia using Tullio macro that makes the code simple and clean.
However, I’m not able to get the same performances even with simple functions.

Here’s the minimal JAX WE and the relative timings on my laptop

import jax
jax.config.update("jax_enable_x64", True)
jax.config.update('jax_platform_name', 'cpu')
import jax.numpy as jnp
from flax import struct  

@struct.dataclass
class TPLParams:
  x_low: float
  x_high: float
  alpha: float

@jax.jit
def pdf(x, params):
  # not normalized
  return jnp.where((params.x_low < x) & (x < params.x_high),
    x**params.alpha,
    0.)

@jax.jit
def cdf(x, params):
  # not normalized. if m = m_high the result is the normalization of the pdf
  return jnp.where(params.alpha==-1,
    jnp.log(params.x_low) - jnp.log(x),
    (x**(1 + params.alpha) - params.x_low**(1 + params.alpha)) / (1 + params.alpha)
  )

@jax.jit
def test(x, params):
  norm = cdf(params.x_high, params)
  return pdf(x, params) / norm

#test
p = TPLParams(x_low=1., x_high=10., alpha=2.)

x_scalar = 5.
x_vec = jnp.linspace(0,11,5000)
x_mat = jnp.array([x_vec for _ in range(300)])

# compile
test(x_scalar, p)
test(x_vec, p)
test(x_mat, p)

#timing 
%timeit test(x_scalar, p).block_until_ready()
10.6 μs ± 544 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%timeit test(x_vec, p).block_until_ready()
74 μs ± 15.3 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit test(x_mat, p).block_until_ready()
5.41 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

To write the equivalent Julia code I decided to write the pdf and cdf functions for scalar input and broadcast them using Tullio. This is a naive attempt to easily write agnostic CPU/GPU code similarly to JAX.

using LoopVectorization
using Tullio
using BenchmarkTools

abstract type Params end

struct TPLParams <: Params
  x_low::Float64
  x_high::Float64
  alpha::Float64
end

# functions

function pdf(x::Real, params::TPLParams)::Float64
  if params.x_low < x < params.x_high
    return x^params.alpha
  else
    return 0.0
  end
end
pdf(x::AbstractVector{<:Real}, params::Params) = @tullio result[i] := pdf(x[i], params)
pdf(x::AbstractArray{<:Real,2}, params::Params) = @tullio result[i, j] := pdf(x[i, j], params)

function cdf(x::Real, params::TPLParams)::Float64
  if params.alpha == -1
    return log(params.x_low) - log(x)
  else
    return (x^(1 + params.alpha) - params.x_low^(1 + params.alpha)) / (1 + params.alpha)
  end
end
cdf(x::AbstractVector{<:Real}, params::Params) = @tullio result[i] := cdf(x[i], params)
cdf(x::AbstractArray{<:Real,2}, params::Params) = @tullio result[i, j] := cdf(x[i, j], params)

function test(x::Union{Real,AbstractVector{<:Real},AbstractArray{<:Real,2}}, params::Params)
  norm = cdf(params.x_high, params)
  return pdf(x, params) / norm
end

#test
p = TPLParams(1., 10., 2.)

x_scalar = 5.
x_vec = range(0,11,5000)
x_mat = repeat(x_vec', 300, 1)

# compile
test(x_scalar, p)
test(x_vec, p)
test(x_mat, p)

However, the computational times are ~2 times slower than JAX and the standard deviations are much larger in Julia than in JAX (especially in the x_mat case which is the most used one):

During the Julia execution I also get this LoopVectorization warning

┌ Warning: #= /home/mt/.julia/packages/Tullio/2zyFP/src/macro.jl:1093 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ Main ~/.julia/packages/LoopVectorization/ImqiY/src/condense_loopset.jl:1166

Here’s my system information:

Hostname: 	mt-ThinkPad-P14s-Gen-3
CPU(s): 	1 x 12th Gen Intel(R) Core(TM) i5-1240P
CPU target: 	alderlake

∘ CPU 1: 
	→ 12 cores (16 CPU-threads due to 2-way SMT)
	→ 8 "efficiency cores", 4 "performance cores".
	→ 1 NUMA domain

Detected GPUs: 	1
nothing
Notebook launched with 16 threads

Can someone help me in understanding where the performance differences are and how to make the two codes equally fast? Thanks!

Can you try to do

 ifelse( params.x_low < x < params.x_high,
    x^params.alpha,
    0.0
    )

to avoid control flow, if it still warn, can you replace
@tullio result[i] := cdf(x[i], params) by the actual function
@tullio result[i] := ifelse(.....)
nevermind the issue is the control flow and the pdf dispatch i think, I avoided this :

function pdf(x::Real, params::TPLParams)
    return ifelse( params.x_low < x < params.x_high,
    x^params.alpha,
    0.0
    )
end
function pdf(x::AbstractVector{<:Real}, params::Params) 
    @tullio result[i] := ifelse( params.x_low < x[i] < params.x_high,x[i]^params.alpha,0.0)
    return result
end

function pdf(x::AbstractArray{<:Real,2}, params::Params) 
    @tullio result[i,j] :=  ifelse( params.x_low < x[i,j] < params.x_high,
                        x[i,j]^params.alpha,
                        0.0
                        )
    return result
end

and LV seems to work, this may be bcause LV atttemps to put vec types in pdf and doesn’t dispatch well but I’m not sure

1 Like

Thanks for the reply! I’ve tried your solution, but the timings are still not as good as JAX in the “matrix” case:

Moreover, this possible solution does not fit my situation well because I will have to implement other PDFs defined by other subtypes of Params. In my original code, I simply have to rewrite the scalar implementation while leaving the broadcasted ones with Tullio untouched, while in this way of writing the code I have to write for each model also the broadcasted functions.

I see, did you try with Reactant.jl ? If MLIR is the magic happening its what shoud be used here. Also, not sure LV is used in the bench you show did you get the warn messages ?

Try this code. The time decreased from 10.8 ms to 5.8 ms on my computer.

function pdf!(result, x::AbstractArray{<:Real}, params::Params)
    @turbo for i in eachindex(x)
        result[i] = ifelse(params.x_low < x[i] < params.x_high,
            x[i]^params.alpha,
            0.0
        )
    end
    return result
end

function cdf(x::AbstractArray, params::Params)
    return if params.alpha == -1
        log.(params.x_low ./ x)
    else
        @turbo (x^(1 + params.alpha) - params.x_low^(1 + params.alpha)) / (1 + params.alpha)
    end
end

function test!(result, x::Union{Real,AbstractVector{<:Real},AbstractArray{<:Real,2}}, params::Params)
    norm = cdf(params.x_high, params)
    pdf!(result, x, params)
    result ./= norm
    return result
end

const p = TPLParams(1., 10., 2.)
const x_scalar = 5.
const x_vec = range(0, 11, 5000)
const x_mat = repeat(x_vec', 300, 1)
const result = similar(x_mat)

# warmup
test!(result, x_mat, p)
@benchmark test!($result, $x_mat, $p)

Also, not sure LV is used in the bench you show did you get the warn messages ?

With your code I did not get any warn message, with mine yes. Nevertheless, the code with Tullio is faster than the non-multithreaded broadcasted version that looks like test.(x_vec, Ref(p)) or test.(x_mat, Ref(p)). I thus assume that some parallelization is going on with Tullio.

I see, did you try with Reactant.jl ? If MLIR is the magic happening its what shoud be used here.

I tried to reproduce my example using Reactant, but the code gets too complicated and I have difficulties in using the multiple dispatch.

Thanks! This is a very good starting point. Based on your example I tried this:

using LoopVectorization 

@inline function pdf(x::Real, params::TPLParams)::Float64
    ifelse(params.x_low < x < params.x_high, x^params.alpha, 0.0)
end

function pdf1!(result, x::AbstractArray{<:Real}, params::Params)
  @turbo for i in eachindex(x)
      result[i] = ifelse(params.x_low < x[i] < params.x_high,
          x[i]^params.alpha,
          0.0
      )
  end
  return result
end

function pdf2!(result, x::AbstractArray{<:Real}, params::Params)
  @turbo for i in eachindex(x)
    result[i] = pdf(x[i], params)
  end
  return result
end

function cdf(x::Union{Real,AbstractArray{<:Real}}, params::Params)
  return if params.alpha == -1
    log.(params.x_low ./ x)
  else
    @turbo (x^(1 + params.alpha) - params.x_low^(1 + params.alpha)) / (1 + params.alpha)
  end
end

function test1!(result, x::Union{Real,AbstractVector{<:Real},AbstractArray{<:Real,2}}, params::Params)
  norm = cdf(params.x_high, params)
  pdf1!(result, x, params)
  result ./= norm
  return result
end

function test2!(result, x::Union{Real,AbstractVector{<:Real},AbstractArray{<:Real,2}}, params::Params)
  norm = cdf(params.x_high, params)
  pdf2!(result, x, params)
  result ./= norm
  return result
end


const par = TPLParams(1., 10., 2.)
const x_s = 5.
const x_v = range(0, 11, 5000)
const x_m = repeat(x_v', 300, 1)
const result = similar(x_m)

# warmup
test1!(result, x_m, par)
test2!(result, x_m, par)

The results are

@benchmark test1!($result, $x_m, $par)
BenchmarkTools.Trial: 935 samples with 1 evaluation per sample.
 Range (min … max):  5.021 ms …   7.832 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.186 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.341 ms ± 368.486 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▆▄▃▂▂▁      ▄▆▄▃▃▃▂      ▁                                 
  ██████████▇▇█▇███████▇█▆█▇██▆▇▇▆▅▆▅▁▆▇█▇▆▅▆▆▆▄▅▅▁▄▅▆▄▅▄▄▁▄▅ █
  5.02 ms      Histogram: log(frequency) by time      6.64 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

and

@benchmark test2!($result, $x_m, $par)
BenchmarkTools.Trial: 363 samples with 1 evaluation per sample.
 Range (min … max):  12.676 ms …  16.457 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.790 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.793 ms ± 492.885 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▂       ▃▂▃▁ ▁ ▃▂▅█▄▅                                
  ▃▁▁▃▃▄▄▃▃█▆█▇▆▅▆▇██████▆██████▇█▄▄▄▃▁▁▃▃▁▃▁▁▁▃▁▁▃▃▁▄▃▁▁▃▁▁▁▃ ▄
  12.7 ms         Histogram: frequency by time         15.6 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

So, calling a function inside the loop - as in pdf2! - gives a large overhead.
However, the problems with the first approach are:

  1. I’d like to write different pdf methods for different Params types in the scalar case only and use a unique generic vectorized method that uses Julia multiple dispatch to call the correct (scalar) pdf method. This will reduce the code needed to implement the different models.
  2. My code should run on GPU and I’m not sure that @turbo for i in eachindex(x) works. Is it necessary to use KernelAbstractions or AcceleratedKernels?

I hoped that Tullio was the correct solution for both problems

Yeah AK is definitly the way to go then (beside Reactant).
I wonder how this looks :

using BenchmarkTools
import AcceleratedKernels as AK
abstract type Params end

struct TPLParams <: Params
  x_low::Float64
  x_high::Float64
  alpha::Float64
end

# functions

@fastmath function pdf(x::Real, params::TPLParams)
  if params.x_low < x< params.x_high
    return x^params.alpha
  else
    return 0.0
  end
end
pdf!(res,x::AbstractVector{<:Real}, params::Params) = AK.map!(x->pdf(x,params),res,x)
pdf!(res,x::AbstractArray{<:Real,2}, params::Params) = AK.map!(x->pdf(x,params),res,x)

@fastmath function cdf(x::Real, params::TPLParams)
  if params.alpha == -1
    return log(params.x_low) - log(x)
  else
    return (x^(1+params.alpha)-params.x_low^(1+params.alpha))/(1+params.alpha)
  end
end
cdf!(res,x::AbstractVector{<:Real}, params::Params) = AK.map!( x-> cdf(x, params), res, x)
cdf!(res,x::AbstractArray{<:Real,2}, params::Params) = AK.map!( x-> cdf(x, params), res, x)

function test!(res,x::Union{Real,AbstractVector{<:Real},AbstractArray{<:Real,2}}, params::Params)
  norm = cdf(params.x_high, params)
  pdf!(res,x, params) 
  AK.map!(x->x/norm, res, res)
 AK.KernelAbstractions.synchronize(AK.get_backend(res))
  return res
end

#test
p = TPLParams(1., 10., 2.)

x_scalar = [5.0]
x_vec = range(0,11,5000) |> collect
x_mat = repeat(x_vec', 300, 1)

res1 = similar(x_scalar)
res2 = similar(x_vec)
res3 = similar(x_mat)

# compile
@btime test!($res1,$x_scalar, $p);
@btime test!($res2,$x_vec, $p);
@btime test!($res3,$x_mat, $p);

If this is fine performant wize you can just add |> CuArray at the end of each x_… lines and youre good to go for gpu. Note that its actually spawning 2 kernels which jax would auto merge and which we could very easily do too AK.map!(x->pdf(x,params)*inn, res, x) where inn=1/norm but I wanted to keep the code closest to yours since you have a bigger project in mind.

In my case :
jax :

0.0095552 ms
0.0508866 ms
3.9619381 ms

julia AK CPU :

  0.000040 ms 
  0.010800 ms
  0.986700 ms

CARE : didn’t check if AK.map!(x->x/norm, res, res) is safe but guessed so

I compared the code you provided with my previous code, and I am curious as to why the results are like this on my computer.

@benchmark test!($res3, $x_mat, $p)
AcceleratedKernels
1 thread 7.117 ms (4 allocations: 80 bytes)
16 threads 872.995 μs (172 allocations: 14.08 KiB)

LoopVectorization
1 thread 1.417 ms (0 allocations: 0 bytes)
16 threads 1.302 ms (0 allocations: 0 bytes)

This is because LV was not using threads but just simd, to use threads you need @tturbo with 2 t.

So LV do vec<8> on only one thread while AK will not simd but use all threads.

But for obvious reasons you can’t make an LV code work on gpu (unless you use Tullio to dispatch but you will loose perfs)

Thank you, but I forgot to mention that when I tested 16 threads LV, I did use @tturbo, and the result was 1.302 ms. There was not much improvement. :smiling_face_with_tear:

Yes, I see.

Ah yeah the size of the array may be too low and simd and threads are not composing well.

However we already beat JAX with both (KA or LV):partying_face:

1 Like

Thanks! This is of great help (even if on my laptop timings are more similar than in your case):

By the way, I tried this solution in a more complicated example and had few other issue, but I’ve just opened a specific topic for this.