Need fast Array operation!

avnsh9 · October 8, 2025, 4:58am

Hi all, I have shifted from python where I was using a bicubic interpolation in numpy. Code for that is below,

import numpy as np
kernel_x = np.random.rand(100,4)
kernel_y = np.random.rand(100,4)
ix0 = np.random.randint(0, 37, size=(100,))
iy0 = np.random.randint(0, 17, size=(100,))
kcoeff = np.random.rand(40,20,5011,10)
def bicubic_numpy(n_layer,kernel_x,kernel_y,ix0,iy0,kcoeff):

    kcoeff_intp = np.zeros((n_layer,5011,10))
    for n in range(n_layer):
        for i in range(4):
            for j in range(4):
                kcoeff_intp[n,:,:] += kernel_x[n,i] * kernel_y[n,j] * kcoeff[ix0[n] + i, iy0[n] + j,:,:]    ### removed -1 term after j
    return np.exp(np.log(10.0) * kcoeff_intp)

Timing of above function using,

%timeit bicubic_numpy(100,kernel_x,kernel_y,ix0,iy0,kcoeff)

178 ms ± 258 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

When I shifted same problem to julia, my naive implementation looks like this,

function my_bicubic_implementation!(cache::Array{Float64}, N_layer::Int64, kernel_x::Matrix{Float64},kernel_y::Matrix{Float64},ix0::Vector{Int64},iy0::Vector{Int64},kcoeff::Array{Float64})
    
    for n in 1:N_layer
        for i in 1:4
            for j in 1:4
                for k in 1:size(kcoeff,3)
                    for l in 1:size(kcoeff,4)
                        cache[n,k,l] = cache[n,k,l] + kernel_x[n,i] * kernel_y[n,j] * kcoeff[ix0[n]+i, iy0[n]+j, k, l]

                    end
                end
            end
        end
    end
    @. cache = 10.0^cache

end

benchamrking the above implementation with BenchmarkTools for random arrays as,

kcoeff_intp = zeros(Float64, 100, 5011, 10)
kernel_x = rand(100, 4)
kernel_y = rand(100, 4)
ix0 = rand(1:37, 100)   # subtract 1 if you want 0-based like NumPy
iy0 = rand(1:17, 100)
kcoeff = rand(40, 20, 5011, 10)

gives a time of 3s, which was obviously very slow. Then after applying some performance tips from others on discord, I was able to get it to same level as numpy. Below is the final version of the code (where used @turbo micro (1.5x improvement), used accumulator variable (3x improvement), chage index ordering of outer loop to respect column major choice of julia (4x improvement)).

function bicubic(
  N_layer::Integer, kernel_x::AbstractMatrix{F}, kernel_y::AbstractMatrix{F},
  ix0::AbstractVector{<:Integer}, iy0::AbstractVector{<:Integer},
  kcoeff::AbstractArray{F, 4}
) where {F<:AbstractFloat}
  klim, llim = size(kcoeff, 3), size(kcoeff, 4)
  buf = zeros(F, N_layer, klim, llim)
  
  @inbounds for l in 1:llim, k in 1:klim, n in 1:N_layer
    acc = zero(F)                                       # <= accumulator variable
    @turbo for i in 1:4, j in 1:4
      acc += kernel_x[n, i] * kernel_y[n, j] * kcoeff[ix0[n] + i, iy0[n] + j, k, l]
    end
    buf[n, k, l] = exp10(acc)                           # <= `exp10` instead of `10 ^`
  end
  
  return buf
end

This runs with a timing of 160 ms, which is of the same order as numpy. This is now satisfactory for me but it would be great if someone can give tips on how to improve this further. This would help a lot.
Thanks!

Benny · October 8, 2025, 5:31am

Have you used constant inputs to verify you’re getting the same results? These don’t look equivalent on first glance, even if accounting for the index ordering change.

avnsh9 · October 8, 2025, 5:33am

Yes, I have checked both numpy and julia implementation by passing same random array from numpy to julia using NPZ package, and found them them to be same

gdalle · October 8, 2025, 5:53am

Maybe using the GPU could speed things up?

Benny · October 8, 2025, 5:54am

Unfortunately there’s an error in the Julia implementations that needs your clarification. If you remove the @inbounds and @turbo (implied inbounds) annotations in order to restore array bounds safety, you can hit an error:

julia> @benchmark bicubic($100, $kernel_x, $kernel_y, $ix0, $iy0, $kcoeff)
ERROR: BoundsError: attempt to access 40×20×5011×10 Array{Float64, 4} at index [12, 21, 1, 1]

So in the @inbounds+@turbo implementation, you were writing some elements out of the array’s bounds, which in practice could be outside the array’s buffer or overwriting other elements. I have no idea how you’re reaching the same result as the NumPy implementation, NumPy should have boundschecks.

The dimensions heavily implies this expression:

kcoeff[ix0[n] + i, iy0[n] + j, k, l]

iy0[n] + j reaches 21, which exceeds the 20 in kcoeff’s dimensions. When you incremented j to a maximum of 4 and iy0 to a maximum of 17 from Python to Julia, you ended up adding 2 to those indices, which conflicts with incrementing kcoeff’s indices by 1. An adjustment is needed here.

technocrat · October 8, 2025, 6:12am

using BenchmarkTools
using Random
using LinearAlgebra
using Base.Threads

# --- demo data (reproducible) ---
Random.seed!(42)
kernel_x = rand(100, 4)
kernel_y = rand(100, 4)
ix0 = rand(1:37, 100)   # ensures ix0+0:3 ∈ 1:40
iy0 = rand(1:17, 100)   # ensures iy0+0:3 ∈ 1:20
kcoeff = rand(40, 20, 5011, 10)  # Float64

# Optional: avoid oversubscription if you also use Julia threads
# BLAS.set_num_threads(1)

function bicubic_julia_optimized(n_layer, kernel_x, kernel_y, ix0, iy0, kcoeff)
    # Result: n_layer × 5011 × 10
    kcoeff_intp = zeros(eltype(kcoeff), n_layer, size(kcoeff,3), size(kcoeff,4))

    @inbounds @views Threads.@threads for n in 1:n_layer
        # 4×4 weights as outer product (no temp allocs)
        W = kernel_x[n, :] * (kernel_y[n, :])'
        accv = vec(view(kcoeff_intp, n, :, :))  # 5011*10 vector

        x0 = ix0[n]; y0 = iy0[n]
        @inbounds for di in 0:3
            ix = x0 + di
            for dj in 0:3
                iy = y0 + dj
                w = W[di+1, dj+1]
                @inbounds if w != 0.0
                    srcv = vec(view(kcoeff, ix, iy, :, :))  # 5011*10 vector
                    BLAS.axpy!(w, srcv, accv)              # accv .+= w .* srcv
                end
            end
        end
    end

    # In-place base-10 exponential
    map!(exp10, kcoeff_intp, kcoeff_intp)
    return kcoeff_intp
end

# --- quick sanity run & benchmark ---
n_layer = size(kernel_x, 1)
@btime bicubic_julia_optimized($n_layer, $kernel_x, $kernel_y, $ix0, $iy0, $kcoeff);

34.272 ms (210 allocations: 38.25 MiB)

Performance substantially degrades if called in a threaded environment, with with the optional adjustment to avoid oversubscription.

avnsh9 · October 8, 2025, 6:19am

Thanks @Benny for pointing out this out of index issue, I did some checks and found out this:

For numpy comparison, the numpy array that i am importing, has ix0 and iy0 arrays from 0:36 (minimum number is zero and maximum is 36), so when julia does 1:4 loop, it goes from 1:40 which is fine and within the bound. So that comparison is correct, hence giving correct answer
When I generate random arrays in julia itself using ix0 = rand(1:37, 100), maximum index generated here is 37, so in the bicubic function it should through an error as 37+4=41, which is out of bound. But I don’t know why this is not the case when i am runnning it, it is not throwing any error. I am sure that this calculation might be wrong, but i am not comparing this to anywhere

Benny · October 8, 2025, 6:21am

Because you opted out of the boundschecks with @inbounds and @turbo. I supposed the Julia way to match the NumPy arrays would be:

ix0 = rand(0:36, 100)
iy0 = rand(0:16, 100)

which should be fine because nothing is indexed with ix0[n] alone.

How are you verifying that the NumPy and Julia functions have the same result from the same inputs, exactly? It’s very unusual to get the same result from an inbounds calculation (NumPy) and out-of-bounds mistakes (Julia).

karei · October 8, 2025, 6:26am

function bicubic(
N_layer::Integer, kernel_x::AbstractMatrix{F}, kernel_y::AbstractMatrix{F},

Your Julia implementation is already the most optimized version. I’m not sure if your Python code runs on a single thread. I ran a single-threaded CPU test on my computer, and Julia is slightly faster than Python.

import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["BLIS_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

import time
import statistics as stats
import numpy as np

def bicubic_numpy(n_layer, kernel_x, kernel_y, ix0, iy0, kcoeff):
    kcoeff_intp = np.zeros((n_layer, 5011, 10), dtype=kcoeff.dtype)
    for n in range(n_layer):
        for i in range(4):
            for j in range(4):
                kcoeff_intp[n, :, :] += (
                    kernel_x[n, i] * kernel_y[n, j] * kcoeff[ix0[n] + i, iy0[n] + j, :, :]
                )
    return np.exp(np.log(10.0) * kcoeff_intp)

def main():
    N_layer = 100
    kernel_x = np.random.rand(100,4)
    kernel_y = np.random.rand(100,4)
    ix0 = np.random.randint(0, 37, size=(100,))
    iy0 = np.random.randint(0, 17, size=(100,))
    kcoeff = np.random.rand(40,20,5011,10)

    y = bicubic_numpy(N_layer, kernel_x, kernel_y, ix0, iy0, kcoeff)
    REPEAT = 20
    times = []
    for _ in range(REPEAT):
        t0 = time.perf_counter()
        y = bicubic_numpy(N_layer, kernel_x, kernel_y, ix0, iy0, kcoeff)
        t1 = time.perf_counter()
        times.append(t1 - t0)

    print("\n== Python (NumPy) Single Thread ==")
    print(f"runs: {REPEAT}")
    print(f"min   : {min(times)*1e3:10.3f} ms")
    print(f"median: {stats.median(times)*1e3:10.3f} ms")
    print(f"mean  : {stats.mean(times)*1e3:10.3f} ms")

main()

min : 86.584 ms

Julia version (bicubic): 83.648 ms (2 allocations: 38.23 MiB)

avnsh9 · October 8, 2025, 6:28am

using NPZ
arr = npzread("/mnt/data/user/workspace/randon_arr_comp_julia.npz")
kernel_x = arr["kernel_x"]
kernel_y = arr["kernel_y"]
ix0 = arr["ix0"]
iy0 = arr["iy0"]
kcoeff = arr["kcoeff"]

py_res = npzread("/mnt/data/user/workspace/my_code/main_pkg/result.npz")["res"]
res = bicubic(100, kernel_x, kernel_y, ix0, iy0, kcoeff)

isapprox(res, py_res; rtol=1e-5, atol=1e-8)

There is no bound error in this comparison

I have saved these arrays from python

kernel_x = np.random.rand(100,4)
kernel_y = np.random.rand(100,4)
ix0 = np.random.randint(0, 37, size=(100,))
iy0 = np.random.randint(0, 17, size=(100,))
kcoeff = np.random.rand(40,20,5011,10)

I think this will not be out of bound for julia, as i discussed in above thread

avnsh9 · October 8, 2025, 6:38am

Thanks @karei , my code is also single threaded. Maybe my CPU is slow, thats why i was getting 170 ms using numpy

Benny · October 8, 2025, 6:39am

Here you use ix0 and iy0 correctly starting at 0, so you’re doing the calculation correctly. It’s the original pure Julia ix0/iy0 that causes the bounds error.

NumPy does multithread when possible, in C I presume, though it’s not made clear which functions do. Julia is single-threaded by default except for LinearAlgebra’s underlying non-Julia BLAS library (BLAS.get_num_threads()). Julia multithreading requires starting the process with --threads=, and unless you’re using a package that does it for you, you have to write the multithreading parts yourself.

avnsh9 · October 8, 2025, 6:47am

Thanks @technocrat for sharing this, this implementation seems to use multiple cores, but I want them to run on single core only. All the implementation and timing that i have provided till now runs on single thread. When i set BLAS.set_num_threads(1) in your code, it seems to take 900 ms on my machine.

avnsh9 · October 8, 2025, 6:50am

I know numpy can use multiple threads, but here in this case I set all BLAS and MKL related threads to 1 before starting the run, and i also kept an eye on htop, so i am sure that all the timing that i am reporting either on numpy or julia is with single thread only

avnsh9 · October 8, 2025, 6:57am

For now, if you are facing the out of bound issue, then generate ix0 and iy0 from 0:36 & 0:16 only, this should solve that. One other condition is that code should be single threaded. If anyone can improve the original version on this condition, then it would be very helpfull, I will verify that implementation with numpy array that I have anyway.
Thanks!

Benny · October 8, 2025, 7:03am

NumPy multithreads in more ways than BLAS or MKL, so I would avoid attempting to level the playing field like this. Just run what comes naturally, as automatic multithreading is a feature of a library.

FWIW, this is the most naive Julia translation I could think of, if I’m recalling NumPy semantics correctly. The views reproduce NumPy’s basic slicing, and I ended up keeping the loop order the same because it turns out the indexing into kcoeff is the bottleneck on my machine. As discussed, there’s no multithreading here. I did the last line in-place to save a couple allocations, but it has negligible difference in runtime.

function bicubic_numpy(n_layer,kernel_x,kernel_y,ix0,iy0,kcoeff)
    kcoeff_intp = zeros(n_layer,5011,10)
    for n in 1:n_layer, i in 1:4, j in 1:4
         @view(kcoeff_intp[n,:,:]) .+= (kernel_x[n,i] * kernel_y[n,j]) .* @view(kcoeff[ix0[n] + i, iy0[n] + j,:,:])
    end
    kcoeff_intp .= exp.(log(10.0) .* kcoeff_intp)
    kcoeff_intp
end

It has a more reasonable discrepancy from bicubic and passes the isapprox test too:

julia> using BenchmarkTools # assume methods already defined

julia> begin
       kcoeff_intp = zeros(Float64, 100, 5011, 10)
       kernel_x = rand(100, 4)
       kernel_y = rand(100, 4)
       ix0 = rand(0:36, 100)
       iy0 = rand(0:16, 100)
       kcoeff = rand(40, 20, 5011, 10)
       n_layer = 100
       isapprox(@btime(bicubic_numpy($n_layer,$kernel_x,$kernel_y,$ix0,$iy0,$kcoeff)),
                @btime(bicubic($n_layer,$kernel_x,$kernel_y,$ix0,$iy0,$kcoeff));
                rtol=1e-5, atol=1e-8)
       end
  822.078 ms (3 allocations: 38.23 MiB)
  157.877 ms (3 allocations: 38.23 MiB)
true

technocrat · October 8, 2025, 7:09am

This is single-threaded:

julia> @btime bicubic_julia_optimized($n_layer, $kernel_x, $kernel_y, $ix0, $iy0, $kcoeff);
  34.122 ms (210 allocations: 38.25 MiB)

julia> Threads.nthreads()
1

and, in fact, throwing more threads at it is counterproductive.

julia> @btime bicubic_julia_optimized($n_layer, $kernel_x, $kernel_y, $ix0, $iy0, $kcoeff);
  361.190 ms (525 allocations: 38.29 MiB)

julia> Threads.nthreads()
64

avnsh9 · October 8, 2025, 7:16am

Yes it is single threaded from your side, but seems that BLAS call inside the function is using multiple threads. When I set BLAS.num_threads to 1 then it becomes true single threaded, but then runtime increases significantly

technocrat · October 8, 2025, 7:22am

Good catch. I see an eight-fold increase in runtime, also.

avnsh9 · October 8, 2025, 7:25am

I also tested your implementation with numpy array and it passes the test. But it is very slow (~1 s in my case)

Topic		Replies	Views
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	11157	October 15, 2019
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	3017	July 16, 2018
Why is this simple function twice as slow as its Python version Performance question	97	4721	April 12, 2021
Quite bad performance of Julia 0.6.4 vs Python+Numpy General Usage	26	5336	November 13, 2018
Dot function General Usage	45	5691	September 26, 2018

Need fast Array operation!

Related topics