Need fast Array operation!

Hi all, I have shifted from python where I was using a bicubic interpolation in numpy. Code for that is below,

import numpy as np
kernel_x = np.random.rand(100,4)
kernel_y = np.random.rand(100,4)
ix0 = np.random.randint(0, 37, size=(100,))
iy0 = np.random.randint(0, 17, size=(100,))
kcoeff = np.random.rand(40,20,5011,10)
def bicubic_numpy(n_layer,kernel_x,kernel_y,ix0,iy0,kcoeff):

    kcoeff_intp = np.zeros((n_layer,5011,10))
    for n in range(n_layer):
        for i in range(4):
            for j in range(4):
                kcoeff_intp[n,:,:] += kernel_x[n,i] * kernel_y[n,j] * kcoeff[ix0[n] + i, iy0[n] + j,:,:]    ### removed -1 term after j
    return np.exp(np.log(10.0) * kcoeff_intp)

Timing of above function using,

%timeit bicubic_numpy(100,kernel_x,kernel_y,ix0,iy0,kcoeff)

178 ms ± 258 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

When I shifted same problem to julia, my naive implementation looks like this,

function my_bicubic_implementation!(cache::Array{Float64}, N_layer::Int64, kernel_x::Matrix{Float64},kernel_y::Matrix{Float64},ix0::Vector{Int64},iy0::Vector{Int64},kcoeff::Array{Float64})
    
    for n in 1:N_layer
        for i in 1:4
            for j in 1:4
                for k in 1:size(kcoeff,3)
                    for l in 1:size(kcoeff,4)
                        cache[n,k,l] = cache[n,k,l] + kernel_x[n,i] * kernel_y[n,j] * kcoeff[ix0[n]+i, iy0[n]+j, k, l]

                    end
                end
            end
        end
    end
    @. cache = 10.0^cache

end

benchamrking the above implementation with BenchmarkTools for random arrays as,

kcoeff_intp = zeros(Float64, 100, 5011, 10)
kernel_x = rand(100, 4)
kernel_y = rand(100, 4)
ix0 = rand(1:37, 100)   # subtract 1 if you want 0-based like NumPy
iy0 = rand(1:17, 100)
kcoeff = rand(40, 20, 5011, 10)

gives a time of 3s, which was obviously very slow. Then after applying some performance tips from others on discord, I was able to get it to same level as numpy. Below is the final version of the code (where used @turbo micro (1.5x improvement), used accumulator variable (3x improvement), chage index ordering of outer loop to respect column major choice of julia (4x improvement)).

function bicubic(
  N_layer::Integer, kernel_x::AbstractMatrix{F}, kernel_y::AbstractMatrix{F},
  ix0::AbstractVector{<:Integer}, iy0::AbstractVector{<:Integer},
  kcoeff::AbstractArray{F, 4}
) where {F<:AbstractFloat}
  klim, llim = size(kcoeff, 3), size(kcoeff, 4)
  buf = zeros(F, N_layer, klim, llim)
  
  @inbounds for l in 1:llim, k in 1:klim, n in 1:N_layer
    acc = zero(F)                                       # <= accumulator variable
    @turbo for i in 1:4, j in 1:4
      acc += kernel_x[n, i] * kernel_y[n, j] * kcoeff[ix0[n] + i, iy0[n] + j, k, l]
    end
    buf[n, k, l] = exp10(acc)                           # <= `exp10` instead of `10 ^`
  end
  
  return buf
end

This runs with a timing of 160 ms, which is of the same order as numpy. This is now satisfactory for me but it would be great if someone can give tips on how to improve this further. This would help a lot.
Thanks!

3 Likes

Have you used constant inputs to verify you’re getting the same results? These don’t look equivalent on first glance, even if accounting for the index ordering change.

1 Like

Yes, I have checked both numpy and julia implementation by passing same random array from numpy to julia using NPZ package, and found them them to be same

1 Like

Maybe using the GPU could speed things up?

Unfortunately there’s an error in the Julia implementations that needs your clarification. If you remove the @inbounds and @turbo (implied inbounds) annotations in order to restore array bounds safety, you can hit an error:

julia> @benchmark bicubic($100, $kernel_x, $kernel_y, $ix0, $iy0, $kcoeff)
ERROR: BoundsError: attempt to access 40×20×5011×10 Array{Float64, 4} at index [12, 21, 1, 1]

So in the @inbounds+@turbo implementation, you were writing some elements out of the array’s bounds, which in practice could be outside the array’s buffer or overwriting other elements. I have no idea how you’re reaching the same result as the NumPy implementation, NumPy should have boundschecks.

The dimensions heavily implies this expression:

kcoeff[ix0[n] + i, iy0[n] + j, k, l]

iy0[n] + j reaches 21, which exceeds the 20 in kcoeff’s dimensions. When you incremented j to a maximum of 4 and iy0 to a maximum of 17 from Python to Julia, you ended up adding 2 to those indices, which conflicts with incrementing kcoeff’s indices by 1. An adjustment is needed here.

4 Likes
using BenchmarkTools
using Random
using LinearAlgebra
using Base.Threads

# --- demo data (reproducible) ---
Random.seed!(42)
kernel_x = rand(100, 4)
kernel_y = rand(100, 4)
ix0 = rand(1:37, 100)   # ensures ix0+0:3 ∈ 1:40
iy0 = rand(1:17, 100)   # ensures iy0+0:3 ∈ 1:20
kcoeff = rand(40, 20, 5011, 10)  # Float64

# Optional: avoid oversubscription if you also use Julia threads
# BLAS.set_num_threads(1)

function bicubic_julia_optimized(n_layer, kernel_x, kernel_y, ix0, iy0, kcoeff)
    # Result: n_layer × 5011 × 10
    kcoeff_intp = zeros(eltype(kcoeff), n_layer, size(kcoeff,3), size(kcoeff,4))

    @inbounds @views Threads.@threads for n in 1:n_layer
        # 4×4 weights as outer product (no temp allocs)
        W = kernel_x[n, :] * (kernel_y[n, :])'
        accv = vec(view(kcoeff_intp, n, :, :))  # 5011*10 vector

        x0 = ix0[n]; y0 = iy0[n]
        @inbounds for di in 0:3
            ix = x0 + di
            for dj in 0:3
                iy = y0 + dj
                w = W[di+1, dj+1]
                @inbounds if w != 0.0
                    srcv = vec(view(kcoeff, ix, iy, :, :))  # 5011*10 vector
                    BLAS.axpy!(w, srcv, accv)              # accv .+= w .* srcv
                end
            end
        end
    end

    # In-place base-10 exponential
    map!(exp10, kcoeff_intp, kcoeff_intp)
    return kcoeff_intp
end

# --- quick sanity run & benchmark ---
n_layer = size(kernel_x, 1)
@btime bicubic_julia_optimized($n_layer, $kernel_x, $kernel_y, $ix0, $iy0, $kcoeff);

34.272 ms (210 allocations: 38.25 MiB)

Performance substantially degrades if called in a threaded environment, with with the optional adjustment to avoid oversubscription.

2 Likes

Thanks @Benny for pointing out this out of index issue, I did some checks and found out this:

  1. For numpy comparison, the numpy array that i am importing, has ix0 and iy0 arrays from 0:36 (minimum number is zero and maximum is 36), so when julia does 1:4 loop, it goes from 1:40 which is fine and within the bound. So that comparison is correct, hence giving correct answer

  2. When I generate random arrays in julia itself using ix0 = rand(1:37, 100), maximum index generated here is 37, so in the bicubic function it should through an error as 37+4=41, which is out of bound. But I don’t know why this is not the case when i am runnning it, it is not throwing any error. I am sure that this calculation might be wrong, but i am not comparing this to anywhere

1 Like

Because you opted out of the boundschecks with @inbounds and @turbo. I supposed the Julia way to match the NumPy arrays would be:

ix0 = rand(0:36, 100)
iy0 = rand(0:16, 100)

which should be fine because nothing is indexed with ix0[n] alone.

How are you verifying that the NumPy and Julia functions have the same result from the same inputs, exactly? It’s very unusual to get the same result from an inbounds calculation (NumPy) and out-of-bounds mistakes (Julia).

2 Likes

function bicubic(
N_layer::Integer, kernel_x::AbstractMatrix{F}, kernel_y::AbstractMatrix{F},

Your Julia implementation is already the most optimized version. I’m not sure if your Python code runs on a single thread. I ran a single-threaded CPU test on my computer, and Julia is slightly faster than Python.

import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["BLIS_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"

import time
import statistics as stats
import numpy as np

def bicubic_numpy(n_layer, kernel_x, kernel_y, ix0, iy0, kcoeff):
    kcoeff_intp = np.zeros((n_layer, 5011, 10), dtype=kcoeff.dtype)
    for n in range(n_layer):
        for i in range(4):
            for j in range(4):
                kcoeff_intp[n, :, :] += (
                    kernel_x[n, i] * kernel_y[n, j] * kcoeff[ix0[n] + i, iy0[n] + j, :, :]
                )
    return np.exp(np.log(10.0) * kcoeff_intp)

def main():
    N_layer = 100
    kernel_x = np.random.rand(100,4)
    kernel_y = np.random.rand(100,4)
    ix0 = np.random.randint(0, 37, size=(100,))
    iy0 = np.random.randint(0, 17, size=(100,))
    kcoeff = np.random.rand(40,20,5011,10)

    y = bicubic_numpy(N_layer, kernel_x, kernel_y, ix0, iy0, kcoeff)
    REPEAT = 20
    times = []
    for _ in range(REPEAT):
        t0 = time.perf_counter()
        y = bicubic_numpy(N_layer, kernel_x, kernel_y, ix0, iy0, kcoeff)
        t1 = time.perf_counter()
        times.append(t1 - t0)

    print("\n== Python (NumPy) Single Thread ==")
    print(f"runs: {REPEAT}")
    print(f"min   : {min(times)*1e3:10.3f} ms")
    print(f"median: {stats.median(times)*1e3:10.3f} ms")
    print(f"mean  : {stats.mean(times)*1e3:10.3f} ms")

main()

min : 86.584 ms

Julia version (bicubic): 83.648 ms (2 allocations: 38.23 MiB)

1 Like
using NPZ
arr = npzread("/mnt/data/user/workspace/randon_arr_comp_julia.npz")
kernel_x = arr["kernel_x"]
kernel_y = arr["kernel_y"]
ix0 = arr["ix0"]
iy0 = arr["iy0"]
kcoeff = arr["kcoeff"]

py_res = npzread("/mnt/data/user/workspace/my_code/main_pkg/result.npz")["res"]
res = bicubic(100, kernel_x, kernel_y, ix0, iy0, kcoeff)

isapprox(res, py_res; rtol=1e-5, atol=1e-8)

There is no bound error in this comparison

I have saved these arrays from python

kernel_x = np.random.rand(100,4)
kernel_y = np.random.rand(100,4)
ix0 = np.random.randint(0, 37, size=(100,))
iy0 = np.random.randint(0, 17, size=(100,))
kcoeff = np.random.rand(40,20,5011,10)

I think this will not be out of bound for julia, as i discussed in above thread

Thanks @karei , my code is also single threaded. Maybe my CPU is slow, thats why i was getting 170 ms using numpy

1 Like

Here you use ix0 and iy0 correctly starting at 0, so you’re doing the calculation correctly. It’s the original pure Julia ix0/iy0 that causes the bounds error.

NumPy does multithread when possible, in C I presume, though it’s not made clear which functions do. Julia is single-threaded by default except for LinearAlgebra’s underlying non-Julia BLAS library (BLAS.get_num_threads()). Julia multithreading requires starting the process with --threads=, and unless you’re using a package that does it for you, you have to write the multithreading parts yourself.

1 Like

Thanks @technocrat for sharing this, this implementation seems to use multiple cores, but I want them to run on single core only. All the implementation and timing that i have provided till now runs on single thread. When i set BLAS.set_num_threads(1) in your code, it seems to take 900 ms on my machine.

I know numpy can use multiple threads, but here in this case I set all BLAS and MKL related threads to 1 before starting the run, and i also kept an eye on htop, so i am sure that all the timing that i am reporting either on numpy or julia is with single thread only

For now, if you are facing the out of bound issue, then generate ix0 and iy0 from 0:36 & 0:16 only, this should solve that. One other condition is that code should be single threaded. If anyone can improve the original version on this condition, then it would be very helpfull, I will verify that implementation with numpy array that I have anyway.
Thanks!

NumPy multithreads in more ways than BLAS or MKL, so I would avoid attempting to level the playing field like this. Just run what comes naturally, as automatic multithreading is a feature of a library.

FWIW, this is the most naive Julia translation I could think of, if I’m recalling NumPy semantics correctly. The views reproduce NumPy’s basic slicing, and I ended up keeping the loop order the same because it turns out the indexing into kcoeff is the bottleneck on my machine. As discussed, there’s no multithreading here. I did the last line in-place to save a couple allocations, but it has negligible difference in runtime.

function bicubic_numpy(n_layer,kernel_x,kernel_y,ix0,iy0,kcoeff)
    kcoeff_intp = zeros(n_layer,5011,10)
    for n in 1:n_layer, i in 1:4, j in 1:4
         @view(kcoeff_intp[n,:,:]) .+= (kernel_x[n,i] * kernel_y[n,j]) .* @view(kcoeff[ix0[n] + i, iy0[n] + j,:,:])
    end
    kcoeff_intp .= exp.(log(10.0) .* kcoeff_intp)
    kcoeff_intp
end

It has a more reasonable discrepancy from bicubic and passes the isapprox test too:

julia> using BenchmarkTools # assume methods already defined

julia> begin
       kcoeff_intp = zeros(Float64, 100, 5011, 10)
       kernel_x = rand(100, 4)
       kernel_y = rand(100, 4)
       ix0 = rand(0:36, 100)
       iy0 = rand(0:16, 100)
       kcoeff = rand(40, 20, 5011, 10)
       n_layer = 100
       isapprox(@btime(bicubic_numpy($n_layer,$kernel_x,$kernel_y,$ix0,$iy0,$kcoeff)),
                @btime(bicubic($n_layer,$kernel_x,$kernel_y,$ix0,$iy0,$kcoeff));
                rtol=1e-5, atol=1e-8)
       end
  822.078 ms (3 allocations: 38.23 MiB)
  157.877 ms (3 allocations: 38.23 MiB)
true
2 Likes

This is single-threaded:

julia> @btime bicubic_julia_optimized($n_layer, $kernel_x, $kernel_y, $ix0, $iy0, $kcoeff);
  34.122 ms (210 allocations: 38.25 MiB)

julia> Threads.nthreads()
1

and, in fact, throwing more threads at it is counterproductive.

julia> @btime bicubic_julia_optimized($n_layer, $kernel_x, $kernel_y, $ix0, $iy0, $kcoeff);
  361.190 ms (525 allocations: 38.29 MiB)

julia> Threads.nthreads()
64

Yes it is single threaded from your side, but seems that BLAS call inside the function is using multiple threads. When I set BLAS.num_threads to 1 then it becomes true single threaded, but then runtime increases significantly

2 Likes

Good catch. I see an eight-fold increase in runtime, also.

I also tested your implementation with numpy array and it passes the test. But it is very slow (~1 s in my case)