Need fast Array operation!

My time 822ms is close enough to 1s that I’d chalk it up to our different machines. As far as I can tell from my naive translation, the few extra allocations in the np.exp line had a negligible impact on runtimes relative to the total number-crunching, and bicubic being on par with NumPy indicates that the secret sauce was vectorization (in the SIMD sense, not the array operation API sense).

This was a reasonable outcome. NumPy/SciPy and its dependencies implement many optimizations, some top-of-the-line, so straightforward usage of its functions on NumPy arrays is usually as fast as practical. The issues only really come up when throwing pure Python elementwise functions into the mix e.g. numpy.vectorize, in which case there are still Python-world compiler options like Numba and Cython.

To speak more broadly, many languages have an optimizing compiler like Julia, and although it’s essential for good performance, it isn’t the only factor. An optimizing compiler can transform code quite a bit, but there are sensible limits when user intent is ambiguous. NumPy’s optimizations, whether it’s vectorization or the various multithreading schemes, are manually implemented for supported platforms in various compiled dependencies and applied to carefully designed C loops over NumPy arrays; @turbo asks you to adhere to a list of restrictions on your loop, otherwise it’ll blindly run into mistakes like the out-of-bounds overwriting. I’d say that NumPy was actually more convenient in this case, and Julia’s tendency to put optimizations in our hands warrants improvements to the ecosystem and discoverability.

2 Likes

function bicubic(
N_layer::Integer, kernel_x::AbstractMatrix{F}, kernel_y::AbstractMatrix{F},

Your current Julia code is a pure-Julia implementation, and it’s already near the performance ceiling. In your specific use case, even routing the current Julia kernel through BLAS is unlikely to help.

Why BLAS won’t move the needle here:

  • The kernel is a fixed 4×4 stencil (16 fused multiply-adds) per output plus an exp10. That granularity is too small—BLAS call overhead would dominate.
  • The access pattern is “gather 4×4 patches at (ix0[n]+i, iy0[n]+j) and accumulate over (k,l),” which isn’t GEMM-shaped or contiguous. Forcing it into AXPY/GEMM would hit strided, cache-unfriendly slices.
  • The real costs are scattered memory traffic and the exponential; BLAS doesn’t accelerate either. A tight vectorized microkernel (e.g., LoopVectorization) already emits near-roofline FMA code on CPU.

Given that, achieving parity with NumPy’s C ufunc path under strict single-thread settings on both sides is already an excellent outcome.

5 Likes

I tried to write a kernel just to see on gpu

using KernelAbstractions
using BenchmarkTools

@kernel inbounds=true unsafe_indices=true function bicubic_kernel!(
    buf, N_layer, kernel_x, kernel_y, ix0, iy0, kcoeff
)
    k, l, n = @index(Global, NTuple)

    klim, llim = size(kcoeff,1), size(kcoeff,2)

    @fastmath if k <= klim && l <= llim && n <= N_layer
        ix0n = ix0[n]
        iy0n = iy0[n]

        acc = zero(eltype(buf))
        @inbounds for i in 1:4, j in 1:4
            acc += kernel_x[i,n] * kernel_y[j,n] *
                   kcoeff[k, l, ix0n + i, iy0n + j]
        end

        buf[k, l, n] = exp10(acc)
    end
end

function bicubic2( N_layer, kernel_x, kernel_y, ix0, iy0, kcoeff)
    klim, llim = size(kcoeff, 1), size(kcoeff, 2)
    buf = KernelAbstractions.allocate(get_backend(kcoeff),eltype(kcoeff), klim, llim,N_layer)
    bicubic_kernel!(get_backend(kcoeff))(buf, N_layer, kernel_x, kernel_y, ix0, iy0, kcoeff;ndrange=size(buf))
    buf
end
kcoeff_intp = zeros(Float64, 100, 5011, 10)
kernel_x = rand(4,100)
kernel_y = rand(4,100)
ix0 = rand(0:36, 100)   # subtract 1 if you want 0-based like NumPy
iy0 = rand(0:16, 100)
kcoeff = rand( 5011, 10,40, 20)
N_layer = 100
@btime bicubic2($N_layer,$kernel_x,$kernel_y,$ix0,$iy0,$kcoeff);

and suprisingly I get the best time on cpu and gpu I got yet
39.512 ms (198 allocations: 38.25 MiB) on cpu and 2.641 ms (108 allocations: 3.19 KiB) on gpu, for single threaded (ie groupsize = size(buf)) I get 109.986 ms (19 allocations: 38.23 MiB) which isn’t so bad yet maybe beatable

2 Likes

As an aside, all of these type declarations do nothing for performance. They just make your code less generic (e.g. it doesn’t work in single precision).

5 Likes

Sorry but i was placed on hold being a new user for 24 hours :grin:.There is no multithreading in my version of julia implementation, you can verify this yourself by running it and open a htop window side by side, you won’t see more then one thread being used