General questions from Python user

In the course of my weekend procrastinations I contrieved an example to check my statement: some home-made implementation of convolution in both languages.

Make up some signal and kernel data in Python (unimportant part which is not benchmarked):

import numpy as np

def lorentz(x, gamma, mu=0):
    return (gamma**2/(((x-mu)**2)+gamma**2))/(np.pi*gamma)

def signal(x):
    return lorentz(x, 0.1, -0.3) + lorentz(x, 0.03, 0.1) + lorentz(x, 0.2, 0.25)

def gauss(x, sigma, mu=0.0):
    return np.exp(-(((x-mu)/sigma)**2)/2.0)/(sigma*np.sqrt(2*np.pi))

def kernel(x):
    return gauss(x, 0.01)

vgauss = np.vectorize(gauss)
vlorentz = np.vectorize(lorentz)
vkernel = np.vectorize(kernel)
vsignal = np.vectorize(signal)

k_xmax = 0.05
s_xmax = 0.5

prec = 4
len_s = 1 + 10**prec
q = 2*s_xmax/10**prec
len_k2 = int(k_xmax/q)
len_k = 1 + 2*len_k2

s_x = np.linspace(-s_xmax, s_xmax, len_s)
k_x = np.linspace(-k_xmax, k_xmax, len_k)

s_y = vsignal(s_x)
k_y = vkernel(k_x)

Now the same in Julia:

lorentz(x, γ, μ=0) = (γ^2/((x-μ)^2+γ^2))/(π*γ)
signal(x) = lorentz(x, 0.1, -0.3) + lorentz(x, 0.03, 0.1) + lorentz(x, 0.2, 0.25)

gauss(x, σ, μ=0.0) = exp(-(((x-μ)/σ)^2)/2)/(σ*sqrt(2π))
kernel(x) = gauss(x, 0.01)

s_xmax = 0.5
k_xmax = 0.05

prec = 4
len_s = 1 + 10^prec
q = 2*s_xmax/10^prec
len_k2 = Int(k_xmax/q)
len_k = 1 + 2len_k2

s_x = LinRange(-s_xmax, s_xmax, len_s)
k_x = LinRange(-k_xmax, k_xmax, len_k)

s_y = signal.(s_x)
k_y = kernel.(k_x)

Now the convolution in Python:

def convolve1(s_y, k_y, len_s, len_k, q):
    r = np.zeros_like(s_y)
    k_rev = np.flip(k_y)
    len_k2 = int((len_k - 1) / 2)
    fst = 1 + len_k2
    lst = len_s - len_k2 - 1
    for i in range(fst, lst):
        r[i-1] = q * np.sum(s_y[i-len_k2:i-len_k2+len_k] * k_rev)
    return r

and in Julia:

function convolve1(s_y, k_y, len_s, len_k, q)
    r = zeros(len_s)
    k_rev = reverse(k_y)
    len_k2 = Int((len_k - 1) / 2)
    fst = 1 + len_k2
    lst = len_s - len_k2 - 1
    for i in fst:lst
        r[i] = q * sum(s_y[i-len_k2+1:i-len_k2+len_k] .* k_rev)
    end
    return r
end

In fact I of course wrote Julia first, but let’s pretend the code was ported from Python to Julia. As to the code: I can’t guarantee it’s mathematically or otherwise correct, but we don’t need no convolution - just some identical code in both languages, reasonably vectorized in Numpy on the Python side - which it is, I believe.

Let’s test:

def cnv1():
    return convolve1(s_y, k_y, len_s, len_k, q)

import timeit

tm = np.amin(timeit.repeat(cnv1, repeat=50, number=1))

print("tm = ",  tm)
# tm =  0.062673
julia> using BenchmarkTools
julia> @btime convolve1(s_y, k_y, len_s, len_k, q);
#   24.115 ms (18003 allocations: 140.71 MiB)

Julia is 2.5 times faster - not bad! But still we see a lot of allocations meaning a room for improvement. As mentioned in one of the posts above, let’s use views instead of slices :

function convolve2(s_y, k_y, len_s, len_k, q)
    r = zeros(len_s)
    k_rev = reverse(k_y)
    len_k2 = Int((len_k - 1) / 2)
    fst = 1 + len_k2
    lst = len_s - len_k2 - 1
    for i in fst:lst
        r[i] = q * sum(view(s_y, i-len_k2+1:i-len_k2+len_k) .* k_rev)
    end
    return r
end

# julia> @btime convolve2(s_y, k_y, len_s, len_k, q);
#   13.560 ms (9003 allocations: 70.40 MiB)

Substantial improvement, still a lot of allocations. Next try:

function convolve3(s_y, k_y, len_s, len_k, q)
    r = zeros(len_s)
    t = similar(k_y) # pre-allocate a buffer
    k_rev = reverse(k_y)
    len_k2 = Int((len_k - 1) / 2)
    fst = 1 + len_k2
    lst = len_s - len_k2 - 1
    for i in fst:lst
        t .= view(s_y, i-len_k2+1:i-len_k2+len_k) .* k_rev
        r[i] = q * sum(t)
    end
    return r
end

# julia> @btime convolve3(s_y, k_y, len_s, len_k, q);
#   1.860 ms (4 allocations: 94.27 KiB)

That looks good! Actually I would wish the compiler was smart enough to do the preallocation on it’s own here.

Probably it can be made still faster by some kind of multithreading, as the for loop here would perfectly lend itself to parallel execution. Lets’s try it:
Edit: don’t try it - see posts below concerning race conditions

function convolve4(s_y, k_y, len_s, len_k, q)
    r = zeros(len_s)
    t = similar(k_y)
    k_rev = reverse(k_y)
    len_k2 = Int((len_k - 1) / 2)
    fst = 1 + len_k2
    lst = len_s - len_k2 - 1
    Threads.@threads for i in fst:lst
        t .= view(s_y, i-len_k2+1:i-len_k2+len_k) .* k_rev
        r[i] = q * sum(t)
    end
    return r
end

# julia> @btime convolve4(s_y, k_y, len_s, len_k, q);
#   4.737 ms (25 allocations: 97.73 KiB)

It got worse :frowning_face: Maybe the dataset to small for effective use of parallel processing? Let’s make it 10 times bigger:

prec = 5

julia> @btime convolve3(s_y, k_y, len_s, len_k, q);
  305.176 ms (6 allocations: 937.92 KiB)
julia> @btime convolve4(s_y, k_y, len_s, len_k, q);
  216.561 ms (27 allocations: 941.39 KiB)

Here we indeed got some improvement by multithreading. Can it be improved further?

Now we got it much better, let’s try to get it much worse :wink: Let’s feed our first implementation not with arrays of floats, but with arrays of Any. It is not as contrieved as some would probably claim: The data could come e.g. from a file or some other source without the type guarantee.

s_y_any = Array{Any,1}(undef, len_s)
s_y_any .= s_y
k_y_any = Array{Any,1}(undef, len_k)
k_y_any .= k_y

# julia> @btime convolve1(s_y_any, k_y_any, len_s, len_k, q);
#   476.886 ms (13499992 allocations: 347.25 MiB)

That’t 8 times slower than Python.

4 Likes