# Julia vs SciPy - Performance comparison and benchmark help

Hello all!

I’m dusting off a one-function package I made a while ago, but I want to confirm that the benchmarks are honest. There’s some calls to scipy objects, and I would like to know if I’m measuring any sort of unfair conversions in the code.

The code is from StagedFilters.jl, but it’s a single function here:

``````module StagedFilters

"""
Savitzky-Golay filter of window half-width M and degree
N. M is the number of points before and after to interpolate, i.e. the full
width of the window is 2M+1.
"""
abstract type AbstractStagedFilters end

struct SavitzkyGolayFilter{M,N} <: AbstractStagedFilters end
wrapL(i, n) = ifelse(1 ≤ i, i, i + n)
wrapR(i, n) = ifelse(i ≤ n, i, i - n)

"""
smooth!(filter,data, smoothed) -
apply `filter` to `data` writing result to `smoothed`.
Note that feeding `Int`s and not floats as data will result in a performance slowdown.
"""
@generated function smooth!(::Type{SavitzkyGolayFilter{M,N}}, data :: AbstractArray{T}, smoothed :: AbstractArray{S}) where {M,N,T,S}

J = T[(i - M - 1 )^(j - 1) for i = 1:2M + 1, j = 1:N + 1]
e₁ = [one(T); zeros(T,N)]
C = J' \ e₁
pre = :(for i = 1:\$M end)
main = :(for i = \$(M + 1):n - \$M end)
post = :(for i = n - \$(M - 1):n end)

for loop in (pre, main, post)
body = loop.args.args

idx = loop !== pre ? :(i - \$M) : :(wrapL(i - \$M, n))   # Manually start the first iteration. See the "false" branch below.

for j = reverse(1:M-1) # Because we bumped out the first iteration, we have to reduce the for loop index by one.
idx = loop !== pre ? :(i - \$j) : :(wrapL(i - \$j, n))
end

for j = 1:M
idx = loop !== post ? :(i + \$j) : :(wrapR(i + \$j, n))
push!(body, :( x = muladd(\$(C[M + 1 + j]), data[\$idx], x))) # muladd
end
push!(body, :(smoothed[i] = x))
end

last_expr = quote
n = length(data)
n == length(smoothed) || throw(DimensionMismatch())
@inbounds \$pre; @inbounds  \$main; @inbounds \$post
return smoothed
end

return last_expr = Base.remove_linenums!(last_expr)
end;

export SavitzkyGolayFilter, smooth!

end # module
``````

The benchmark script is the following:

``````using StagedFilters
using Test

using PyCall, BenchmarkTools;

N = 1_000_000

data = convert.(Float64,collect(range(1,N,length=N)));
smoothed = zeros(eltype(data),length(data)); # <--- wholesome, type stable code.
savgol = pyimport("scipy.signal")."savgol_filter";
x = PyObject(data);

@info "Julia f64x\$N"
@btime smooth!(SavitzkyGolayFilter{2,2}, \$data, \$smoothed);
@info "SciPy f64x\$N"
@btime pycall(\$savgol, PyObject, \$x,5,2,mode="wrap");

data = convert.(Float32,collect(range(1,N,length=N)));
smoothed = zeros(eltype(data),length(data)); # <--- wholesome, type stable\$
savgol = pyimport("scipy.signal")."savgol_filter";
x = PyObject(data);

@info "Julia f32x\$N"
@btime smooth!(SavitzkyGolayFilter{2,2}, \$data, \$smoothed);
@info "SciPy f32x\$N"
@btime pycall(\$savgol, PyObject, \$x,5,2,mode="wrap");
nothing;

``````

Which on my computer leads to:

``````[ Info: Julia f64x1000000
1.267 ms (0 allocations: 0 bytes)
[ Info: SciPy f64x1000000
11.432 ms (20 allocations: 1.02 KiB)
[ Info: Julia f32x1000000
491.619 μs (0 allocations: 0 bytes)
[ Info: SciPy f32x1000000
4.984 ms (20 allocations: 1.02 KiB)
``````

Which is a ~10x speedup over Scipy on arrays that big… HOWEVER, When I run the same script under `]test StagedFilters`, I get only a ~2x difference in performance. What gives?

1 Like

I’ve seen cases where benchmarking through PyCall would add some overhead compared to doing everything in Python (e.g., `%timeit` in IPython). The overhead maybe was due to accidental copies of arrays. To avoid any issues I’d simply benchmark Python code with Python itself and avoid any doubts about overhead introduced by PyCall.

3 Likes

Thanks, that’s not a bad idea. This is my attempt:

``````
In : data = [i / 1.0 for i in range(1000000)]

In : f = scipy.signal.savgol_filter

In : %%timeit
...: f(data, 5, 2, mode="wrap")
...:
...:
58 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

``````

And with `numpy.float32`:

``````
In : data = numpy.float32(data)

In : %%timeit
...: f(data, 5, 2, mode="wrap")
...:
...:
11.3 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
``````

Wait, am I missing something or is that slower than benchmarking with Pycall? 2 Likes

That’s what I’m trying to figure out too - I find it weird that they can’t SIMD at all, even with numpy Float64s:

``````
In : data = numpy.float64(data)

In : %%timeit
...: f(data, 5, 2, mode="wrap")
...:
...:
12.3 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
``````

Dammit, I should use VectorizationBase.jl for the threaded AVX version…

Swapping out the `@inbounds \$main` for `LoopVectorization.jl`'s `@avxt \$main` gets a 2x speedup, thanks to the @Elrod magic:

``````
Testing Running tests...
[ Info: Julia Float64
1.007 ms (0 allocations: 0 bytes)
[ Info: Python Float64
6.750 ms (45 allocations: 7.63 MiB)
[ Info: Julia Float32
223.918 μs (0 allocations: 0 bytes)
[ Info: Python Float32
5.527 ms (45 allocations: 3.82 MiB)
``````

I’m using 8 threads, and I was hoping to see something closer to that level of speedup.

`timeit` used from IPython’s `%timeit` turns off GC during the benchmark. So, I suspect that the output of the scipy function has to re-allocate the new array and so it is always not in the cache. Not sure how much it is relevant, but maybe a fair comparison requires allocating a new “cold” array for each call to the Julia function without GC’ing the old ones.

(This may not be an issue in `@btime pycall(...)` but I’m not sure how cross-language GC interop implemented in PyCall and the complicated code generated by BenchmarkTools interact. It’s possible that the overall effect is similar to `timeit`.)

4 Likes

I get

Benchmark code (test/benchmarks.jl)
``````julia> using StagedFilters
[ Info: Precompiling StagedFilters [c9c6d8d2-eaa5-11e8-3089-811798346984]

julia> using Test

julia> using PyCall, BenchmarkTools;

julia> N = 1_000_000
1000000

julia> data = convert.(Float64,collect(range(1,N,length=N)));

julia> smoothed = zeros(eltype(data),length(data)); # <--- wholesome, type stable code.

julia> savgol = pyimport("scipy.signal")."savgol_filter";
/home/chriselrod/.julia/conda/3/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)

julia> x = PyObject(data);

julia> @info "Julia f64x\$N"
[ Info: Julia f64x1000000

julia> @btime smooth!(SavitzkyGolayFilter{2,2}, \$data, \$smoothed);
17.774 μs (0 allocations: 0 bytes)

julia> @info "SciPy f64x\$N"
[ Info: SciPy f64x1000000

julia> @btime pycall(\$savgol, PyObject, \$x,5,2,mode="wrap");
4.074 ms (20 allocations: 1.02 KiB)

julia> data = convert.(Float32,collect(range(1,N,length=N)));

julia> smoothed = zeros(eltype(data),length(data)); # <--- wholesome, type stable\$

julia> savgol = pyimport("scipy.signal")."savgol_filter";

julia> x = PyObject(data);

julia> @info "Julia f32x\$N"
[ Info: Julia f32x1000000

julia> @btime smooth!(SavitzkyGolayFilter{2,2}, \$data, \$smoothed);
8.338 μs (0 allocations: 0 bytes)

julia> @info "SciPy f32x\$N"
[ Info: SciPy f32x1000000

julia> @btime pycall(\$savgol, PyObject, \$x,5,2,mode="wrap");
2.882 ms (20 allocations: 1.02 KiB)
``````
``````julia> @info "Julia f64x\$N"
[ Info: Julia f64x1000000

julia> @btime smooth!(SavitzkyGolayFilter{2,2}, \$data, \$smoothed);
17.774 μs (0 allocations: 0 bytes)

julia> @info "SciPy f64x\$N"
[ Info: SciPy f64x1000000

julia> @btime pycall(\$savgol, PyObject, \$x,5,2,mode="wrap");
4.074 ms (20 allocations: 1.02 KiB)

julia> @info "Julia f32x\$N"
[ Info: Julia f32x1000000

julia> @btime smooth!(SavitzkyGolayFilter{2,2}, \$data, \$smoothed);
8.338 μs (0 allocations: 0 bytes)

julia> @info "SciPy f32x\$N"
[ Info: SciPy f32x1000000

julia> @btime pycall(\$savgol, PyObject, \$x,5,2,mode="wrap");
2.882 ms (20 allocations: 1.02 KiB)

julia> 4.074e-3 / 17.774e-6
229.2112073815686

julia> 2.882e-3 / 8.338e-6
345.64643799472293
``````

So that’s a 200x and 300x speedup over Python.

I believe the reason I see a much larger speedup is

``````julia> using VectorizationBase

julia> N*sizeof(Float64)*2 / (VectorizationBase.num_l2cache() * VectorizationBase.cache_size(Val(2)))
0.8477105034722222
``````

Basically, 2 of these `Float64` arrays occupy about 85% of the total L2 cache when running with multiple threads on this computer.
For systems with less L2 cache, or if running with a single thread, this problem will be severely memory bound.

3 Likes

This should be pretty reasonable for benchmarking, since you pre-converted the input `x` and are suppressing the conversion of the output, so PyCall should not be adding any extra overhead.

2 Likes

`@avxt` timings:

``````julia> begin
@info "Julia f32x\$N"
@btime smooth!(SavitzkyGolayFilter{2,2}, \$data32, \$smoothed32);
@info "Julia f64x\$N"
@btime smooth!(SavitzkyGolayFilter{2,2}, \$data64, \$smoothed64);
end;
[ Info: Julia f32x1000000
8.263 μs (0 allocations: 0 bytes)
[ Info: Julia f64x1000000
17.468 μs (0 allocations: 0 bytes)
``````

`@inbounds @simd` timings:

``````julia> begin
@info "Julia f32x\$N"
@btime smooth!(SavitzkyGolayFilter{2,2}, \$data32, \$smoothed32);
@info "Julia f64x\$N"
@btime smooth!(SavitzkyGolayFilter{2,2}, \$data64, \$smoothed64);
end;
[ Info: Julia f32x1000000
245.119 μs (0 allocations: 0 bytes)
[ Info: Julia f64x1000000
493.014 μs (0 allocations: 0 bytes)
``````

`@avx`'s speed was about the same as `@inbounds @simd`.
That means I got nearly a 30x improvement from threading on an 18 core system:

``````julia> 493/17.5
28.17142857142857

julia> 245/8.263
29.650248093912623
``````

Which is quite good if you consider that `@avxt` is limiting itself to 18 threads (it will not use the full 36). Meaning I’m getting better than 1-to-1 scaling: 18 threads → 28x+ speedup.

As I said above, this problem is constrained by memory bandiwdth, and throwing enough cores at the problem lets the problem fit in their L2 caches.

The speedup would be much smaller if the arrays weren’t already hot in cache.

3 Likes

Thanks, I believe I looked at some of your materials before when I did the original benchmarking, so thanks for helping me cover all the bases.

I had to reread this a few times, but now I get the `throw enough cores at it``everyting fits in a smaller cache``superlinear speedup` analysis. Thanks for the benchmarking.

Now on to throwing the GPU at this.

Yeah, there is one L2 cache per core, so with enough cores it’ll fit in the L2 cache.
Cascadelake and Skylake-X also have 1 MiB L2 caches, which is much larger than in most consumer chips.

GPUs have much higher memory bandwidths than most CPUs, so this should do really well on a GPU.

3 Likes

In addition, `%%timeit` reports mean execution time, whereas `@btime` reports min time. Due to disabled GC this probably does not make a too large difference.
You could take the min time in Python e.g. By

``````def py_time(f, number=1_000):
timer = timeit.Timer(f)
time = min(timer.repeat(number=number))
return time/number
``````

Alternatively, use `@benchmark` to get minimum, median, mean and maximum.

6 Likes