Need fast Array operation!

function bicubic(
N_layer::Integer, kernel_x::AbstractMatrix{F}, kernel_y::AbstractMatrix{F},

Your current Julia code is a pure-Julia implementation, and it’s already near the performance ceiling. In your specific use case, even routing the current Julia kernel through BLAS is unlikely to help.

Why BLAS won’t move the needle here:

  • The kernel is a fixed 4×4 stencil (16 fused multiply-adds) per output plus an exp10. That granularity is too small—BLAS call overhead would dominate.
  • The access pattern is “gather 4×4 patches at (ix0[n]+i, iy0[n]+j) and accumulate over (k,l),” which isn’t GEMM-shaped or contiguous. Forcing it into AXPY/GEMM would hit strided, cache-unfriendly slices.
  • The real costs are scattered memory traffic and the exponential; BLAS doesn’t accelerate either. A tight vectorized microkernel (e.g., LoopVectorization) already emits near-roofline FMA code on CPU.

Given that, achieving parity with NumPy’s C ufunc path under strict single-thread settings on both sides is already an excellent outcome.

7 Likes