Fast 4D argmax

Hi Julia-folks!
I’m new to Julia (very new), but work mainly in R. I’m trying to replace a slow bit of R code with a Julia function that I will call from within R. Essentially what I’m trying to achieve is argmax of a 4D array across the 4th dimension. My first attempt was based on argmax and mapslices to get the output format I need back in R:

A = rand(3, 10, 20, 5)

function whichmax4D1(array4d::Array{Float64, 4})
mapslices(argmax, array4d, dims=4)[:,:,:,]
end

@btime whichmax4D1(A)
1.084 ms (6059 allocations: 190.59 KiB)

So when called from within R this was slower than my R version. So I had a go at making things faster, based mainly on how I made the R version faster:

function whichmax4D2(array4d::Array{Float64, 4})
dims = size(array4d)
slicedim = dims[1:3]
num = prod(slicedim)
span = (1:dims[4]) *num .- num
res = Int[]
for i in 1:num
push!(res, findmax( array4d[i .+ span] )[2])
end
reshape(res, slicedim)
end

@btime whichmax4D2(A)
102.865 μs (612 allocations: 91.50 KiB)

res1 = whichmax4D1(A)
res2 = whichmax4D2(A)
res1 == res2

julia> res1 == res2
true

Yay so thats some progress - even with the overhead of calling from within R, its about twice as fast as my tuned R version. But I know almost zero Julia (these are my first Julia functions ), I figure there must be other ways to make this faster? Any suggestions? (I call this function alot small improvements here add up!)
Thank you all!

1 Like

Here’s a start, although this makes an array of things like CartesianIndex(3, 10, 20, 3) before throwing it away, so there is room for improvement:

julia> whichmax4D1(A) == map(I->I[4], dropdims(argmax(A, dims=4), dims=4))
true

julia> @btime whichmax4D1(\$A);  # as above, different computer
465.583 μs (6059 allocations: 190.59 KiB)

julia> @btime whichmax4D2(\$A);
43.209 μs (612 allocations: 91.50 KiB)

julia> @btime map(I->I[4], dropdims(argmax(\$A, dims=4), dims=4));
17.166 μs (27 allocations: 29.03 KiB)

Edit – An attempt using reduce:

julia> function whichmax4D3(A)
R = reduce(CartesianIndices(A), dims=4, init=firstindex(A,4)) do i,J
I = CartesianIndex((J[1], J[2], J[3], i))
@inbounds A[I] > A[J] ? i : J[4]
end
dropdims(R, dims=4)
end
whichmax4D3 (generic function with 1 method)

julia> whichmax4D3(A) == whichmax4D1(A)
true

julia> @btime whichmax4D3(\$A);
7.479 μs (22 allocations: 5.33 KiB)

And some variants on the theme of just writing out all the loops. Doing so directly might be more readable, but anyway, with my package:

julia> using Tullio

julia> findmax4Dt(A) = @tullio (fun) out[i,j,k] := (z, A[i,j,k,z])  init = (0,-Inf);

julia> fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y));

julia> first.(findmax4Dt(A)) == whichmax4D1(A)
true

julia> @btime first.(findmax4Dt(\$A));
3.021 μs (2 allocations: 14.31 KiB)

julia> function whichmax4Dt(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
@tullio (fun) out[i,j,k] = first <| (z, A[i,j,k,z])  init = (0,-Inf)  threads=false
end;

julia> whichmax4Dt(A) == whichmax4D1(A)
true

julia> @btime whichmax4Dt(\$A);
2.227 μs (1 allocation: 4.81 KiB)
5 Likes

Wow so many options - I’m still googling to understand the map version - it will take me some time to grok all of this😅 Thank you very much! I just googled the Tullio package - looks great!

1 Like

As a curiosity, in case it might interest, on an average Android cellphone got the following benchmark for the first function:

julia> @btime whichmax4D1(\$A)
1.239 ms (6059 allocations: 190.59 KiB)

And for Tullio:

julia> @btime whichmax4Dt(\$A);
6.438 μs (1 allocation: 4.81 KiB)
2 Likes

@rafael.guerra wow… my times were on a corei9 macbook

I did some benchmarking on calling the various functions from within R on a bigger array:

The first one is my R code, next 2 are the functions I posted originally, the map, reduce and tullio version are as posted by @mcabbott, and the last one is C++ code run via Rcpp (from a Stackoverflow answer). Also it is Julia 1.5 as Julia 1.6 doens’t yet work with JuliaCall in R.

@mcabbott I have a few questions if I may - first the probably silly one - why do you use \$A in functions above? Documentation tells me it is for expression interpolation - but I don’t get what its doing in the code above - indeed running with or without it doesn’t seem to change anything? Second thing I was wondering - are the speedups you achieved in large part down to reducing allocations, or is there more to it than that? Third question - in the tullio code I tried turning threading on and it got slower. Is this because of the overhead of parallelism, or is the function just not suited to threading?

The \$A is used within @btime to avoid the problems of benchmarking with globals.

1 Like

Globals produce very inefficient code in Julia. Code cannot optimize on a global because another piece of code could change its type. The \$ interpolates the value into the expression to make it a constant within the context of the benchmarking. BenchmarkTools.jl/manual.md at master · JuliaCI/BenchmarkTools.jl · GitHub

Why is this so much faster? We should be able to get closer to this if not beat it. Could you link that for us?

2 Likes

@jroon, there seems to be some inconsistency between benchmarks.

@mcabbott’s Tullio whichmax4Dt() is consistently ~200 times faster in Julia 1.6.0, in both my Win10 laptop and Android cellphone, than the function whichmax4D1(). Which would put it nearly on par with your C results.

However, in your table, their relative speed ratio is only a factor of ~10 ?

1 Like

Wow thanks @roflmaostc @mkitti and @rafael.guerra. Re the \$ operator - got it - makes sense!

@mkitti sure here is the C code: https://stackoverflow.com/a/62909317/2498193

So I’m calling C code through Rcpp and Julia through JuliaCall. Probably there’s different overhead associated with each of these. I don’t yet know how to run C code in Julia which would give a truer comparison I think.

1 Like

One curiosity here: when calling julia from R (or Python), does the Julia code get compiled on every call? If that is the case, to at par with C++ or other compiled language, one would need to perform the outer iteration of the computation within Julia as well.

1 Like

So, I think, that the way it works is JuliaCall sets up a julia session in the background, you define a function which it complies, and then it lives in the julia environment until called. I think if it was compiling those functions each time it would be much slower.

Edit - for example here is how it looks in R to define and use a julia function:

library(JuliaCall)
#julia_install_package_if_needed("Tullio")
julia_library("Tullio")

n = 20 ; d = 5 ; nx = 1000 ; v = 50
array4d <- array( runif(n*nx*v*d ,-1,0),
dim = c(n, nx, v, d) )

julia_command("
function whichmax4Dtullio(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
@tullio (fun) out[i,j,k] = first <| (z, A[i,j,k,z])  init = (0,-Inf)  threads=false
end;
")

test <- julia_call("whichmax4Dtullio", array4d)

I think the function is compiled at the julia_command step (takes about 2 seconds on first run), and then the julia function is run in the julia environment by the julia_call function.

1 Like

That is not really possible, because compilation is specific for each of the types of variables provided. It might be compiled at the first call, and remain available. Can you measure the time of two consecutive calls of the function after the first definition? We would expect the second call to be faster than the first.

1 Like

I think reduced allocations is part of it, and it’s partly a signal that you’re writing more C-like code, simple loops rather than making and passing complicated objects around.

This is a little bit in the weeds, but right now threading and tiling are coupled, and this is really about the latter.

It is reducing over the index z, so it knows that it would never be safe to divide that loop among different threads, as they would both try to write to out[i,j,k] at the same time. Dividing any of i,j,k among different threads should be safe.

But besides this, it also divides large jobs into tiles in the name of cache-friendliness. And normally this includes dividing up the a reduction index like z, and writing the intermediate total into out[i,j,k], to pick up and work with later. And this is what won’t work here – because the intermediate result is a tuple (Int, Float64) but the final result is just an Int. (Issue #36.) I don’t have a great solution, although perhaps it should always give an error – right now it will work until the array is big enough, I think.

On larger sizes, first.(findmax4Dt(A)) was faster for me, despite allocating more. Which is, I think, because it can run the threading and tiling machinery. I haven’t investigated in detail which is helping. The computation is very simple here, so I would guess that it’s all about memory access. And that it should be possible to beat @tullio's defaults by tuning things.

julia> A = rand(50, 50, 50, 50);

julia> @btime first.(findmax4Dt(\$A));
1.903 ms (54 allocations: 2.86 MiB)

5.652 ms (4 allocations: 2.86 MiB)

julia> @btime whichmax4Dt(\$A);
5.357 ms (2 allocations: 976.64 KiB)

julia> @btime whichmax4D3(\$A);  # with reduce
30.339 ms (23 allocations: 977.16 KiB)
3 Likes

The translation of the C++ algorithm posted there to Julia would be something like:

function whichmax4D5(A)
last_dim = size(A,4)
diff = length(A)÷last_dim
result = zeros(Int,diff)
for i in 1:diff
maxval = A[i]
maxind = i
for j in i:diff:length(A)
if A[j] > maxval
maxval = A[j]
maxind = j
end
end
result[i] = (maxind-1)÷diff + 1
end
return reshape(result,size(A)[1:3])
end

Here it takes

julia> @btime whichmax4D5(\$A);
6.881 μs (3 allocations: 4.92 KiB)

which compares to the whichmax4D2 function initially provided:

julia> @btime whichmax4D2(\$A);
45.389 μs (612 allocations: 91.50 KiB)

and to the fastet Tullio version:

julia> @btime whichmax4Dt(\$A);
2.708 μs (1 allocation: 4.81 KiB)

For the larger array we get:

julia> A = rand(50,50,50,50);

julia> whichmax4D5(A) == whichmax4D2(A) == whichmax4Dt(A)
true

julia> @btime whichmax4D2(\$A);
37.855 ms (125019 allocations: 61.13 MiB)

julia> @btime whichmax4D5(\$A);
11.392 ms (4 allocations: 976.75 KiB)

julia> @btime whichmax4Dt(\$A);
9.826 ms (2 allocations: 976.64 KiB)

For the larger array this seems to parallelize quite well, without any effort:

function whichmax4D5_parallel(A)
last_dim = size(A,4)
diff = length(A)÷last_dim
result = zeros(Int,diff)
maxval = A[i]
maxind = i
for j in i:diff:length(A)
if A[j] > maxval
maxval = A[j]
maxind = j
end
end
result[i] = (maxind-1)÷diff + 1
end
return reshape(result,size(A)[1:3])
end
4

julia> @btime whichmax4D5_parallel(\$A);
3.419 ms (25 allocations: 978.62 KiB)

2 Likes

It helps to reshape:

julia> function whichmax5Dt(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
outv = vec(out);
Am = reshape(A, (length(outv), size(A,4)))
@tullio (fun) outv[i] = first <| (z, Am[i,z])  init = (0,-Inf)  threads=false
out
end
whichmax5Dt (generic function with 1 method)

julia> whichmax5Dt(A) == whichmax4D1(A)
true

julia> @btime whichmax5Dt(\$A);
2.017 μs (5 allocations: 4.98 KiB)

(I’ll eventually have LoopVectorization do this automatically, but for now you still need to do it manually. Although these benchmarks weren’t using LV anyway. I saw little performance difference w/ and w/out here)
[EDIT: More importantly, I’ll eventually make LV capable of handing this kind of reduction.]

full benchmarks
julia> using Tullio

julia> A = rand(3, 10, 20, 5);

julia> function whichmax4D1(array4d::Array{Float64, 4})
mapslices(argmax, array4d, dims=4)[:,:,:,]
end
whichmax4D1 (generic function with 1 method)

julia> @btime whichmax4D1(\$A);
426.615 μs (6059 allocations: 190.59 KiB)

julia> findmax4Dt(A) = @tullio (fun) out[i,j,k] := (z, A[i,j,k,z])  init = (0,-Inf);

julia> fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y));

julia> first.(findmax4Dt(A)) == whichmax4D1(A)
true

julia> @btime first.(findmax4Dt(\$A));
4.733 μs (2 allocations: 14.31 KiB)

julia> function whichmax4Dt(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
@tullio (fun) out[i,j,k] = first <| (z, A[i,j,k,z])  init = (0,-Inf)  threads=false
end;

julia> whichmax4Dt(A) == whichmax4D1(A)
true

julia> @btime whichmax4Dt(\$A);
2.793 μs (1 allocation: 4.81 KiB)

julia> function whichmax5Dt(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
outv = vec(out);
Am = reshape(A, (length(outv), size(A,4)))
@tullio (fun) outv[i] = first <| (z, Am[i,z])  init = (0,-Inf)  threads=false
end;

julia> function whichmax5Dt(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
outv = vec(out);
Am = reshape(A, (length(outv), size(A,4)))
@tullio (fun) outv[i] = first <| (z, Am[i,z])  init = (0,-Inf)  threads=false
out
end
whichmax5Dt (generic function with 1 method)

julia> whichmax5Dt(A) == whichmax4D1(A)
true

julia> @btime whichmax5Dt(\$A);
2.017 μs (5 allocations: 4.98 KiB)
2 Likes

I tried but I don’t think its meaningful. If you take a look at the screenshot - I first timed the julia_command which took 12 seconds (doens’t usually take that long!), and then 3 separate calls to the function - which are all in similar range:

I’ll try to find an answer in JuliaCall documentation, or alternatively I’ll try to get the C code running in Julia for direct comparison.

Good to know Also thank you for comments on threading and tiling. I had to look up tiling, but it makes sense.

So does this mean that the size of the z dimension in some sense dictates whether threading is useful or not ? (As luck has it z is likely to be the smallest dimension in practice <10 - the others can all go wild)

1 Like

Another take on writing simple loops:

whichmax4D6(A) = reshape(whichmax4D6(reshape(A, :, size(A)[end])), size(A)[1:end-1])

function whichmax4D6(A::AbstractMatrix)
result = zeros(Int, axes(A,1))
@inbounds for i in axes(A,1)
r, m = 1, typemin(eltype(A))
for j in axes(A,2)
v = A[i,j]
r, m = ifelse(v>m, (j,v), (r,m))
end
result[i] = r
end
result
end
julia> A = rand(3, 10, 20, 5);  # original size

julia> @btime whichmax4D5(\$A);  # simple loops by @leandromartinez98
5.992 μs (3 allocations: 4.92 KiB)

julia> @btime whichmax4D6(\$A);  # faster than my best above!
1.721 μs (5 allocations: 5.02 KiB)

julia> @btime whichmax5Dt(\$A);  # ... but not as fast as @Elrod's variant
1.454 μs (5 allocations: 4.98 KiB)

julia> A = rand(50, 50, 50, 50);  # my large size above

julia> @btime whichmax4D6(\$A);    # here slower than first.(findmax4Dt(\$A));
5.420 ms (6 allocations: 976.84 KiB)

julia> @btime whichmax4D5_parallel(\$A);
2.017 ms (25 allocations: 978.38 KiB)

2.348 ms (27 allocations: 978.45 KiB)
3 Likes

Times of the order of 500 ms is what it takes here including compilation, so it may be compiling every time.

(here the time for Alarge = rand(20,400,40,3); is 3.360 ms for the “equivalent” implementation as the C++ one (version 4D5 above), which is comparable to the 2.3 ms you reported).

1 Like

That seems way too long. If you start a fresh session (R and Julia), do you still get those numbers?

I think we have the Julia side well optimized. On the larger 20x400x40x3 array, we’re down to 1 millisecond which is faster than the 4.5 ms of which_max_C when called from R.

julia> @time whichmax5Dt(A);
0.001322 seconds (6 allocations: 2.442 MiB)

julia> @btime whichmax5Dt(A);
988.700 μs (6 allocations: 2.44 MiB)

julia> size(A)
(20, 400, 40, 3)

I suspect the issue now is with how the array is transferred via JuliaCall. Perhaps @Non-Contradiction or @ChrisRackauckas has some insight about how to most efficiently transfer large arrays using JuliaCall.

1 Like

I don’t, so it’s up to @Non-Contradiction

2 Likes