Julia vs SciPy - Performance comparison and benchmark help

Hello all!

I’m dusting off a one-function package I made a while ago, but I want to confirm that the benchmarks are honest. There’s some calls to scipy objects, and I would like to know if I’m measuring any sort of unfair conversions in the code.

The code is from StagedFilters.jl, but it’s a single function here:

module StagedFilters

"""
Savitzky-Golay filter of window half-width M and degree
N. M is the number of points before and after to interpolate, i.e. the full
width of the window is 2M+1.
"""
abstract type AbstractStagedFilters end

struct SavitzkyGolayFilter{M,N} <: AbstractStagedFilters end
wrapL(i, n) = ifelse(1 ≤ i, i, i + n)
wrapR(i, n) = ifelse(i ≤ n, i, i - n)

"""
smooth!(filter,data, smoothed) -
apply `filter` to `data` writing result to `smoothed`.
Note that feeding `Int`s and not floats as data will result in a performance slowdown.
"""
@generated function smooth!(::Type{SavitzkyGolayFilter{M,N}}, data :: AbstractArray{T}, smoothed :: AbstractArray{S}) where {M,N,T,S}

  J = T[(i - M - 1 )^(j - 1) for i = 1:2M + 1, j = 1:N + 1]
  e₁ = [one(T); zeros(T,N)]
  C = J' \ e₁
  pre = :(for i = 1:$M end)
  main = :(for i = $(M + 1):n - $M end)
  post = :(for i = n - $(M - 1):n end)

  for loop in (pre, main, post)
      body = loop.args[2].args

      idx = loop !== pre ? :(i - $M) : :(wrapL(i - $M, n))   # Manually start the first iteration. See the "false" branch below.
      push!(body, :( x = muladd($(C[1]), data[$idx], $(zero(T))))) # Swap `muladd` instead of the additions. Note the index of 1.

      for j = reverse(1:M-1) # Because we bumped out the first iteration, we have to reduce the for loop index by one.
          idx = loop !== pre ? :(i - $j) : :(wrapL(i - $j, n))
          push!(body, :( x = muladd($(C[M + 1 - j]),data[$idx],x))) # muladd
      end

      push!(body, :( x = muladd($(C[M + 1]), data[i], x))) # muladd

      for j = 1:M
          idx = loop !== post ? :(i + $j) : :(wrapR(i + $j, n))
          push!(body, :( x = muladd($(C[M + 1 + j]), data[$idx], x))) # muladd
      end
      push!(body, :(smoothed[i] = x))
  end

 last_expr = quote
          n = length(data)
          n == length(smoothed) || throw(DimensionMismatch())
          @inbounds $pre; @inbounds  $main; @inbounds $post
          return smoothed
  end

  return last_expr = Base.remove_linenums!(last_expr)
end;

export SavitzkyGolayFilter, smooth!

end # module

The benchmark script is the following:

using StagedFilters
using Test

using PyCall, BenchmarkTools;

N = 1_000_000

data = convert.(Float64,collect(range(1,N,length=N)));
smoothed = zeros(eltype(data),length(data)); # <--- wholesome, type stable code.
savgol = pyimport("scipy.signal")."savgol_filter";
x = PyObject(data);

@info "Julia f64x$N"
@btime smooth!(SavitzkyGolayFilter{2,2}, $data, $smoothed);
@info "SciPy f64x$N"
@btime pycall($savgol, PyObject, $x,5,2,mode="wrap");

data = convert.(Float32,collect(range(1,N,length=N)));
smoothed = zeros(eltype(data),length(data)); # <--- wholesome, type stable$
savgol = pyimport("scipy.signal")."savgol_filter";
x = PyObject(data);

@info "Julia f32x$N"
@btime smooth!(SavitzkyGolayFilter{2,2}, $data, $smoothed);
@info "SciPy f32x$N"
@btime pycall($savgol, PyObject, $x,5,2,mode="wrap");
nothing;

Which on my computer leads to:

[ Info: Julia f64x1000000
  1.267 ms (0 allocations: 0 bytes)
[ Info: SciPy f64x1000000
  11.432 ms (20 allocations: 1.02 KiB)
[ Info: Julia f32x1000000
  491.619 μs (0 allocations: 0 bytes)
[ Info: SciPy f32x1000000
  4.984 ms (20 allocations: 1.02 KiB)

Which is a ~10x speedup over Scipy on arrays that big… HOWEVER, When I run the same script under ]test StagedFilters, I get only a ~2x difference in performance. What gives?

1 Like

I’ve seen cases where benchmarking through PyCall would add some overhead compared to doing everything in Python (e.g., %timeit in IPython). The overhead maybe was due to accidental copies of arrays. To avoid any issues I’d simply benchmark Python code with Python itself and avoid any doubts about overhead introduced by PyCall.

3 Likes

Thanks, that’s not a bad idea. This is my attempt:


In [18]: data = [i / 1.0 for i in range(1000000)]

In [19]: f = scipy.signal.savgol_filter

In [20]: %%timeit
    ...: f(data, 5, 2, mode="wrap")
    ...:
    ...:
58 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


And with numpy.float32:


In [25]: data = numpy.float32(data)

In [26]: %%timeit
    ...: f(data, 5, 2, mode="wrap")
    ...:
    ...:
11.3 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Wait, am I missing something or is that slower than benchmarking with Pycall? :face_with_raised_eyebrow:

2 Likes

That’s what I’m trying to figure out too - I find it weird that they can’t SIMD at all, even with numpy Float64s:


In [27]: data = numpy.float64(data)

In [28]: %%timeit
    ...: f(data, 5, 2, mode="wrap")
    ...:
    ...:
12.3 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Dammit, I should use VectorizationBase.jl for the threaded AVX version…

Swapping out the @inbounds $main for LoopVectorization.jl’s @avxt $main gets a 2x speedup, thanks to the @Elrod magic:


     Testing Running tests...
[ Info: Julia Float64
  1.007 ms (0 allocations: 0 bytes)
[ Info: Python Float64
  6.750 ms (45 allocations: 7.63 MiB)
[ Info: Julia Float32
  223.918 μs (0 allocations: 0 bytes)
[ Info: Python Float32
  5.527 ms (45 allocations: 3.82 MiB)

I’m using 8 threads, and I was hoping to see something closer to that level of speedup.

timeit used from IPython’s %timeit turns off GC during the benchmark. So, I suspect that the output of the scipy function has to re-allocate the new array and so it is always not in the cache. Not sure how much it is relevant, but maybe a fair comparison requires allocating a new “cold” array for each call to the Julia function without GC’ing the old ones.

(This may not be an issue in @btime pycall(...) but I’m not sure how cross-language GC interop implemented in PyCall and the complicated code generated by BenchmarkTools interact. It’s possible that the overall effect is similar to timeit.)

4 Likes

I get

Benchmark code (test/benchmarks.jl)
julia> using StagedFilters
[ Info: Precompiling StagedFilters [c9c6d8d2-eaa5-11e8-3089-811798346984]

julia> using Test

julia> using PyCall, BenchmarkTools;

julia> N = 1_000_000
1000000

julia> data = convert.(Float64,collect(range(1,N,length=N)));

julia> smoothed = zeros(eltype(data),length(data)); # <--- wholesome, type stable code.

julia> savgol = pyimport("scipy.signal")."savgol_filter";
/home/chriselrod/.julia/conda/3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)

julia> x = PyObject(data);

julia> @info "Julia f64x$N"
[ Info: Julia f64x1000000

julia> @btime smooth!(SavitzkyGolayFilter{2,2}, $data, $smoothed);
  17.774 μs (0 allocations: 0 bytes)

julia> @info "SciPy f64x$N"
[ Info: SciPy f64x1000000

julia> @btime pycall($savgol, PyObject, $x,5,2,mode="wrap");
  4.074 ms (20 allocations: 1.02 KiB)

julia> data = convert.(Float32,collect(range(1,N,length=N)));

julia> smoothed = zeros(eltype(data),length(data)); # <--- wholesome, type stable$

julia> savgol = pyimport("scipy.signal")."savgol_filter";

julia> x = PyObject(data);

julia> @info "Julia f32x$N"
[ Info: Julia f32x1000000

julia> @btime smooth!(SavitzkyGolayFilter{2,2}, $data, $smoothed);
  8.338 μs (0 allocations: 0 bytes)

julia> @info "SciPy f32x$N"
[ Info: SciPy f32x1000000

julia> @btime pycall($savgol, PyObject, $x,5,2,mode="wrap");
  2.882 ms (20 allocations: 1.02 KiB)
julia> @info "Julia f64x$N"
[ Info: Julia f64x1000000

julia> @btime smooth!(SavitzkyGolayFilter{2,2}, $data, $smoothed);
  17.774 μs (0 allocations: 0 bytes)

julia> @info "SciPy f64x$N"
[ Info: SciPy f64x1000000

julia> @btime pycall($savgol, PyObject, $x,5,2,mode="wrap");
  4.074 ms (20 allocations: 1.02 KiB)

julia> @info "Julia f32x$N"
[ Info: Julia f32x1000000

julia> @btime smooth!(SavitzkyGolayFilter{2,2}, $data, $smoothed);
  8.338 μs (0 allocations: 0 bytes)

julia> @info "SciPy f32x$N"
[ Info: SciPy f32x1000000

julia> @btime pycall($savgol, PyObject, $x,5,2,mode="wrap");
  2.882 ms (20 allocations: 1.02 KiB)

julia> 4.074e-3 / 17.774e-6
229.2112073815686

julia> 2.882e-3 / 8.338e-6
345.64643799472293

So that’s a 200x and 300x speedup over Python.

I believe the reason I see a much larger speedup is

julia> using VectorizationBase

julia> N*sizeof(Float64)*2 / (VectorizationBase.num_l2cache() * VectorizationBase.cache_size(Val(2)))
0.8477105034722222

Basically, 2 of these Float64 arrays occupy about 85% of the total L2 cache when running with multiple threads on this computer.
For systems with less L2 cache, or if running with a single thread, this problem will be severely memory bound.

3 Likes

This should be pretty reasonable for benchmarking, since you pre-converted the input x and are suppressing the conversion of the output, so PyCall should not be adding any extra overhead.

2 Likes

@avxt timings:

julia> begin
       @info "Julia f32x$N"
       @btime smooth!(SavitzkyGolayFilter{2,2}, $data32, $smoothed32);
       @info "Julia f64x$N"
       @btime smooth!(SavitzkyGolayFilter{2,2}, $data64, $smoothed64);
       end;
[ Info: Julia f32x1000000
  8.263 μs (0 allocations: 0 bytes)
[ Info: Julia f64x1000000
  17.468 μs (0 allocations: 0 bytes)

@inbounds @simd timings:

julia> begin
       @info "Julia f32x$N"
       @btime smooth!(SavitzkyGolayFilter{2,2}, $data32, $smoothed32);
       @info "Julia f64x$N"
       @btime smooth!(SavitzkyGolayFilter{2,2}, $data64, $smoothed64);
       end;
[ Info: Julia f32x1000000
  245.119 μs (0 allocations: 0 bytes)
[ Info: Julia f64x1000000
  493.014 μs (0 allocations: 0 bytes)

@avx’s speed was about the same as @inbounds @simd.
That means I got nearly a 30x improvement from threading on an 18 core system:

julia> 493/17.5
28.17142857142857

julia> 245/8.263
29.650248093912623

Which is quite good if you consider that @avxt is limiting itself to 18 threads (it will not use the full 36). Meaning I’m getting better than 1-to-1 scaling: 18 threads → 28x+ speedup.

As I said above, this problem is constrained by memory bandiwdth, and throwing enough cores at the problem lets the problem fit in their L2 caches.

The speedup would be much smaller if the arrays weren’t already hot in cache.

3 Likes

Thanks, I believe I looked at some of your materials before when I did the original benchmarking, so thanks for helping me cover all the bases.

I had to reread this a few times, but now I get the throw enough cores at iteveryting fits in a smaller cachesuperlinear speedup analysis. Thanks for the benchmarking.

Now on to throwing the GPU at this.

Yeah, there is one L2 cache per core, so with enough cores it’ll fit in the L2 cache.
Cascadelake and Skylake-X also have 1 MiB L2 caches, which is much larger than in most consumer chips.

GPUs have much higher memory bandwidths than most CPUs, so this should do really well on a GPU.

3 Likes

In addition, %%timeit reports mean execution time, whereas @btime reports min time. Due to disabled GC this probably does not make a too large difference.
You could take the min time in Python e.g. By

def py_time(f, number=1_000):
    timer = timeit.Timer(f)
    time = min(timer.repeat(number=number))
    return time/number

Alternatively, use @benchmark to get minimum, median, mean and maximum.

6 Likes