Fast 4D argmax

@jroon, there seems to be some inconsistency between benchmarks.

@mcabbott’s Tullio whichmax4Dt() is consistently ~200 times faster in Julia 1.6.0, in both my Win10 laptop and Android cellphone, than the function whichmax4D1(). Which would put it nearly on par with your C results.

However, in your table, their relative speed ratio is only a factor of ~10 ?

1 Like

Wow thanks @roflmaostc @mkitti and @rafael.guerra. Re the $ operator - got it - makes sense!

@mkitti sure here is the C code: r - Optimise which.max along multiple dimensions of an array - Stack Overflow

So I’m calling C code through Rcpp and Julia through JuliaCall. Probably there’s different overhead associated with each of these. I don’t yet know how to run C code in Julia which would give a truer comparison I think.

1 Like

One curiosity here: when calling julia from R (or Python), does the Julia code get compiled on every call? If that is the case, to at par with C++ or other compiled language, one would need to perform the outer iteration of the computation within Julia as well.

1 Like

So, I think, that the way it works is JuliaCall sets up a julia session in the background, you define a function which it complies, and then it lives in the julia environment until called. I think if it was compiling those functions each time it would be much slower.

Edit - for example here is how it looks in R to define and use a julia function:

library(JuliaCall)
#julia_install_package_if_needed("Tullio")
julia_library("Tullio")

n = 20 ; d = 5 ; nx = 1000 ; v = 50
array4d <- array( runif(n*nx*v*d ,-1,0),
                  dim = c(n, nx, v, d) )


julia_command("
function whichmax4Dtullio(A)
    fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
    out = similar(A, Int, axes(A)[1:end-1])
    @tullio (fun) out[i,j,k] = first <| (z, A[i,j,k,z])  init = (0,-Inf)  threads=false
end;
")

test <- julia_call("whichmax4Dtullio", array4d)

I think the function is compiled at the julia_command step (takes about 2 seconds on first run), and then the julia function is run in the julia environment by the julia_call function.

1 Like

That is not really possible, because compilation is specific for each of the types of variables provided. It might be compiled at the first call, and remain available. Can you measure the time of two consecutive calls of the function after the first definition? We would expect the second call to be faster than the first.

1 Like

I think reduced allocations is part of it, and it’s partly a signal that you’re writing more C-like code, simple loops rather than making and passing complicated objects around.

This is a little bit in the weeds, but right now threading and tiling are coupled, and this is really about the latter.

It is reducing over the index z, so it knows that it would never be safe to divide that loop among different threads, as they would both try to write to out[i,j,k] at the same time. Dividing any of i,j,k among different threads should be safe.

But besides this, it also divides large jobs into tiles in the name of cache-friendliness. And normally this includes dividing up the a reduction index like z, and writing the intermediate total into out[i,j,k], to pick up and work with later. And this is what won’t work here – because the intermediate result is a tuple (Int, Float64) but the final result is just an Int. (Issue #36.) I don’t have a great solution, although perhaps it should always give an error – right now it will work until the array is big enough, I think.

On larger sizes, first.(findmax4Dt(A)) was faster for me, despite allocating more. Which is, I think, because it can run the threading and tiling machinery. I haven’t investigated in detail which is helping. The computation is very simple here, so I would guess that it’s all about memory access. And that it should be possible to beat @tullio’s defaults by tuning things.

julia> A = rand(50, 50, 50, 50);

julia> @btime first.(findmax4Dt($A));
  1.903 ms (54 allocations: 2.86 MiB)

julia> @btime first.(findmax4Dt_nothreads($A));
  5.652 ms (4 allocations: 2.86 MiB)

julia> @btime whichmax4Dt($A);
  5.357 ms (2 allocations: 976.64 KiB)

julia> @btime whichmax4D3($A);  # with reduce
  30.339 ms (23 allocations: 977.16 KiB)
3 Likes

The translation of the C++ algorithm posted there to Julia would be something like:

function whichmax4D5(A)
  last_dim = size(A,4)
  diff = length(A)÷last_dim
  result = zeros(Int,diff)
  for i in 1:diff
    maxval = A[i]
    maxind = i
    for j in i:diff:length(A)
      if A[j] > maxval
        maxval = A[j]
        maxind = j
      end
    end
    result[i] = (maxind-1)÷diff + 1
  end
  return reshape(result,size(A)[1:3])
end

Here it takes

julia> @btime whichmax4D5($A);
  6.881 μs (3 allocations: 4.92 KiB)

which compares to the whichmax4D2 function initially provided:

julia> @btime whichmax4D2($A);
  45.389 μs (612 allocations: 91.50 KiB)

and to the fastet Tullio version:

julia> @btime whichmax4Dt($A);
  2.708 μs (1 allocation: 4.81 KiB)

For the larger array we get:

julia> A = rand(50,50,50,50);

julia> whichmax4D5(A) == whichmax4D2(A) == whichmax4Dt(A)
true

julia> @btime whichmax4D2($A);
  37.855 ms (125019 allocations: 61.13 MiB)

julia> @btime whichmax4D5($A);
  11.392 ms (4 allocations: 976.75 KiB)

julia> @btime whichmax4Dt($A);
  9.826 ms (2 allocations: 976.64 KiB)

For the larger array this seems to parallelize quite well, without any effort:

function whichmax4D5_parallel(A)
  last_dim = size(A,4)
  diff = length(A)÷last_dim
  result = zeros(Int,diff)
  Threads.@threads for i in 1:diff
    maxval = A[i]
    maxind = i
    for j in i:diff:length(A)
      if A[j] > maxval
        maxval = A[j]
        maxind = j
      end
    end
    result[i] = (maxind-1)÷diff + 1
  end
  return reshape(result,size(A)[1:3])
end
julia> Threads.nthreads()
4

julia> @btime whichmax4D5_parallel($A);
  3.419 ms (25 allocations: 978.62 KiB)


2 Likes

It helps to reshape:

julia> function whichmax5Dt(A)
         fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
         out = similar(A, Int, axes(A)[1:end-1])
         outv = vec(out);
         Am = reshape(A, (length(outv), size(A,4)))
         @tullio (fun) outv[i] = first <| (z, Am[i,z])  init = (0,-Inf)  threads=false
         out
       end
whichmax5Dt (generic function with 1 method)

julia> whichmax5Dt(A) == whichmax4D1(A)
true

julia> @btime whichmax5Dt($A);
  2.017 μs (5 allocations: 4.98 KiB)

(I’ll eventually have LoopVectorization do this automatically, but for now you still need to do it manually. Although these benchmarks weren’t using LV anyway. I saw little performance difference w/ and w/out here)
[EDIT: More importantly, I’ll eventually make LV capable of handing this kind of reduction.]

full benchmarks
julia> using Tullio

julia> A = rand(3, 10, 20, 5);

julia> function whichmax4D1(array4d::Array{Float64, 4})
         mapslices(argmax, array4d, dims=4)[:,:,:,]
       end
whichmax4D1 (generic function with 1 method)

julia> @btime whichmax4D1($A);
  426.615 μs (6059 allocations: 190.59 KiB)

julia> findmax4Dt(A) = @tullio (fun) out[i,j,k] := (z, A[i,j,k,z])  init = (0,-Inf);

julia> fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y));

julia> first.(findmax4Dt(A)) == whichmax4D1(A)
true

julia> @btime first.(findmax4Dt($A));
  4.733 μs (2 allocations: 14.31 KiB)

julia> function whichmax4Dt(A)
         fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
         out = similar(A, Int, axes(A)[1:end-1])
         @tullio (fun) out[i,j,k] = first <| (z, A[i,j,k,z])  init = (0,-Inf)  threads=false
       end;

julia> whichmax4Dt(A) == whichmax4D1(A)
true

julia> @btime whichmax4Dt($A);
  2.793 μs (1 allocation: 4.81 KiB)

julia> function whichmax5Dt(A)
         fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
         out = similar(A, Int, axes(A)[1:end-1])
         outv = vec(out);
         Am = reshape(A, (length(outv), size(A,4)))
         @tullio (fun) outv[i] = first <| (z, Am[i,z])  init = (0,-Inf)  threads=false
       end;

julia> function whichmax5Dt(A)
         fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
         out = similar(A, Int, axes(A)[1:end-1])
         outv = vec(out);
         Am = reshape(A, (length(outv), size(A,4)))
         @tullio (fun) outv[i] = first <| (z, Am[i,z])  init = (0,-Inf)  threads=false
         out
       end
whichmax5Dt (generic function with 1 method)

julia> whichmax5Dt(A) == whichmax4D1(A)
true

julia> @btime whichmax5Dt($A);
  2.017 μs (5 allocations: 4.98 KiB)
2 Likes

I tried but I don’t think its meaningful. If you take a look at the screenshot - I first timed the julia_command which took 12 seconds (doens’t usually take that long!), and then 3 separate calls to the function - which are all in similar range:


I’ll try to find an answer in JuliaCall documentation, or alternatively I’ll try to get the C code running in Julia for direct comparison.

Good to know :+1: Also thank you for comments on threading and tiling. I had to look up tiling, but it makes sense.

So does this mean that the size of the z dimension in some sense dictates whether threading is useful or not ? (As luck has it z is likely to be the smallest dimension in practice <10 - the others can all go wild)

1 Like

Another take on writing simple loops:

whichmax4D6(A) = reshape(whichmax4D6(reshape(A, :, size(A)[end])), size(A)[1:end-1])

function whichmax4D6(A::AbstractMatrix)
  result = zeros(Int, axes(A,1))
  @inbounds for i in axes(A,1)
    r, m = 1, typemin(eltype(A))
    for j in axes(A,2)
      v = A[i,j]
      r, m = ifelse(v>m, (j,v), (r,m))
    end
    result[i] = r
  end
  result
end
julia> A = rand(3, 10, 20, 5);  # original size

julia> @btime whichmax4D5($A);  # simple loops by @leandromartinez98
  5.992 μs (3 allocations: 4.92 KiB)

julia> @btime whichmax4D6($A);  # faster than my best above!
  1.721 μs (5 allocations: 5.02 KiB)

julia> @btime whichmax5Dt($A);  # ... but not as fast as @Elrod's variant
  1.454 μs (5 allocations: 4.98 KiB)

julia> A = rand(50, 50, 50, 50);  # my large size above

julia> @btime whichmax4D6($A);    # here slower than first.(findmax4Dt($A));
  5.420 ms (6 allocations: 976.84 KiB)

julia> @btime whichmax4D5_parallel($A);
  2.017 ms (25 allocations: 978.38 KiB)

julia> @btime whichmax4D6_threads($A);
  2.348 ms (27 allocations: 978.45 KiB)
3 Likes

Times of the order of 500 ms is what it takes here including compilation, so it may be compiling every time.

(here the time for Alarge = rand(20,400,40,3); is 3.360 ms for the “equivalent” implementation as the C++ one (version 4D5 above), which is comparable to the 2.3 ms you reported).

1 Like

That seems way too long. If you start a fresh session (R and Julia), do you still get those numbers?

I think we have the Julia side well optimized. On the larger 20x400x40x3 array, we’re down to 1 millisecond which is faster than the 4.5 ms of which_max_C when called from R.

julia> @time whichmax5Dt(A);
  0.001322 seconds (6 allocations: 2.442 MiB)

julia> @btime whichmax5Dt(A);
  988.700 μs (6 allocations: 2.44 MiB)

julia> size(A)
(20, 400, 40, 3)

I suspect the issue now is with how the array is transferred via JuliaCall. Perhaps @Non-Contradiction or @ChrisRackauckas has some insight about how to most efficiently transfer large arrays using JuliaCall.

1 Like

I don’t, so it’s up to @Non-Contradiction

2 Likes

How about using

julia_assign("A", A)
julia_eval("whichmax5Dt(A)", "Julia")

This should help us evaluate transfer issues.

6 Likes

@mkitti - great suggestion - here is my R code and results now:

# Make an array
n = 20 ; nx = 400 ; v = 40 ; d = 3 
array4d <- array( runif(n*nx*v*d ,-1,0),
                  dim = c(n, nx, v, d) )
#Assign to Julia
julia_assign("A", array4d)

microbenchmark:: microbenchmark(
  res1 <- which_max_4D(array4d),
  res2 <- julia_eval("whichmax4D1(A)", "R"),
  res3 <- julia_eval("whichmax4D2(A)", "R"),
  res4 <- julia_eval("whichmax4Dmap(A)", "R"),
  res5 <- julia_eval("whichmax4Dreduce(A)", "R"),
  res6 <- julia_eval("whichmax4Dtullio(A)", "R"),
  res7 <- julia_eval("whichmax4D5_leandro(A)", "R"),
  res8 <- julia_eval("whichmax4D5_elrod(A)", "R"),
  res9 <- julia_eval("whichmax4D6(A)", "R"),
  res10 <- whichmaxC(array4d), times=500)

Wow!! :smiley:

I also ran a benchmark on the julia_assign command:

Looks like you are right @mkitti - the slow part is moving the data from R to Julia

Edit: I ran it on an even bigger array also:

3 Likes

The performance here is indeed dominated by array transformation. Typically, the time can be decomposed into three parts.

  1. JuliaCall interface function time cost (julia_call on the order of 10 microsends, julia_command on the order of 100 microseconds).
  2. data transfer time cost, which will be on the order of milliseconds for a large data set like this.
  3. julia function time cost.
3 Likes

It should be possible to wrap the raw bits of the R array instead of converting/copying - this is already done here in RCall.jl for R → Julia arrays.

1 Like

Thanks for the information.
The current implementation of JuliaCall is on the “safe” side, so users have little to worry about. I have thought about having something like julia_unsafe_call which does not do things like automatic type conversion/data copying. I will think more about the idea when I have more time on this.
Another thing is that JuliaCall currently uses RCall.rcopy for data copying. I just noticed that the performance here seems to be magnitude slower than copy in julia. And in this case, the rcopy logic should be here: RCall.jl/base.jl at 0b5e9f119008d3a8471833566745311df79101ff · JuliaInterop/RCall.jl · GitHub
I’m wondering whether the performance for rcopy in this case can also be improved.

3 Likes

That all sounds awesome! To give you my perspective on how I’d like to use JuliaCall → I’m fluent enough in R to be effective at getting my work tasks done quickly and dabble with making packages etc (nothing released yet), I expect it would take me a while to get to that level in Julia. But I would like to splash in snippets of Julia here and there were R is particularly slow, as opposed to writing large functions outright. I know many people use Rcpp in this way, but I’ve always found C++ to be utterly impenetrable for me. Learning Julia seems more achievable for me! So from that point of view, achieving blistering speed via a julia_unsafe_call for short functions sounds very appealing, but, I do understand why its something you need to think carefully about also! In any case JuliaCall is already an awesome package - thank you!

I must say, the response to my query here has been great and I’m blown away at all the options presented, and speeds that can be achieved - thank you all for the time you have put into this!

5 Likes

To elaborate on the unsafe aspects, see the comment here:

The asymmetry of the transfers is immediately striking to me. I may be mistaken, but we do not see a lot of overhead of transferring the result from Julia to R. This suggests a strategy of starting with the variable on the Julia side.

Perhaps @randy3k or @simonbyrne could comment from the Rcall.jl perspective. I suspect julia_[unsafe]_call might not be that unsafe if julia_call could protect the data from R’s garbage collection. Some combination of robject, RCall.protect, and unsafe_array might work.

4 Likes