@jroon, there seems to be some inconsistency between benchmarks.
@mcabbott’s Tullio whichmax4Dt() is consistently ~200 times faster in Julia 1.6.0, in both my Win10 laptop and Android cellphone, than the function whichmax4D1(). Which would put it nearly on par with your C results.
However, in your table, their relative speed ratio is only a factor of ~10 ?
So I’m calling C code through Rcpp and Julia through JuliaCall. Probably there’s different overhead associated with each of these. I don’t yet know how to run C code in Julia which would give a truer comparison I think.
One curiosity here: when calling julia from R (or Python), does the Julia code get compiled on every call? If that is the case, to at par with C++ or other compiled language, one would need to perform the outer iteration of the computation within Julia as well.
So, I think, that the way it works is JuliaCall sets up a julia session in the background, you define a function which it complies, and then it lives in the julia environment until called. I think if it was compiling those functions each time it would be much slower.
Edit - for example here is how it looks in R to define and use a julia function:
library(JuliaCall)
#julia_install_package_if_needed("Tullio")
julia_library("Tullio")
n = 20 ; d = 5 ; nx = 1000 ; v = 50
array4d <- array( runif(n*nx*v*d ,-1,0),
dim = c(n, nx, v, d) )
julia_command("
function whichmax4Dtullio(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
@tullio (fun) out[i,j,k] = first <| (z, A[i,j,k,z]) init = (0,-Inf) threads=false
end;
")
test <- julia_call("whichmax4Dtullio", array4d)
I think the function is compiled at the julia_command step (takes about 2 seconds on first run), and then the julia function is run in the julia environment by the julia_call function.
That is not really possible, because compilation is specific for each of the types of variables provided. It might be compiled at the first call, and remain available. Can you measure the time of two consecutive calls of the function after the first definition? We would expect the second call to be faster than the first.
I think reduced allocations is part of it, and it’s partly a signal that you’re writing more C-like code, simple loops rather than making and passing complicated objects around.
This is a little bit in the weeds, but right now threading and tiling are coupled, and this is really about the latter.
It is reducing over the index z, so it knows that it would never be safe to divide that loop among different threads, as they would both try to write to out[i,j,k] at the same time. Dividing any of i,j,k among different threads should be safe.
But besides this, it also divides large jobs into tiles in the name of cache-friendliness. And normally this includes dividing up the a reduction index like z, and writing the intermediate total into out[i,j,k], to pick up and work with later. And this is what won’t work here – because the intermediate result is a tuple (Int, Float64) but the final result is just an Int. (Issue #36.) I don’t have a great solution, although perhaps it should always give an error – right now it will work until the array is big enough, I think.
On larger sizes, first.(findmax4Dt(A)) was faster for me, despite allocating more. Which is, I think, because it can run the threading and tiling machinery. I haven’t investigated in detail which is helping. The computation is very simple here, so I would guess that it’s all about memory access. And that it should be possible to beat @tullio’s defaults by tuning things.
julia> A = rand(50, 50, 50, 50);
julia> @btime first.(findmax4Dt($A));
1.903 ms (54 allocations: 2.86 MiB)
julia> @btime first.(findmax4Dt_nothreads($A));
5.652 ms (4 allocations: 2.86 MiB)
julia> @btime whichmax4Dt($A);
5.357 ms (2 allocations: 976.64 KiB)
julia> @btime whichmax4D3($A); # with reduce
30.339 ms (23 allocations: 977.16 KiB)
The translation of the C++ algorithm posted there to Julia would be something like:
function whichmax4D5(A)
last_dim = size(A,4)
diff = length(A)÷last_dim
result = zeros(Int,diff)
for i in 1:diff
maxval = A[i]
maxind = i
for j in i:diff:length(A)
if A[j] > maxval
maxval = A[j]
maxind = j
end
end
result[i] = (maxind-1)÷diff + 1
end
return reshape(result,size(A)[1:3])
end
julia> A = rand(50,50,50,50);
julia> whichmax4D5(A) == whichmax4D2(A) == whichmax4Dt(A)
true
julia> @btime whichmax4D2($A);
37.855 ms (125019 allocations: 61.13 MiB)
julia> @btime whichmax4D5($A);
11.392 ms (4 allocations: 976.75 KiB)
julia> @btime whichmax4Dt($A);
9.826 ms (2 allocations: 976.64 KiB)
For the larger array this seems to parallelize quite well, without any effort:
function whichmax4D5_parallel(A)
last_dim = size(A,4)
diff = length(A)÷last_dim
result = zeros(Int,diff)
Threads.@threads for i in 1:diff
maxval = A[i]
maxind = i
for j in i:diff:length(A)
if A[j] > maxval
maxval = A[j]
maxind = j
end
end
result[i] = (maxind-1)÷diff + 1
end
return reshape(result,size(A)[1:3])
end
julia> function whichmax5Dt(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
outv = vec(out);
Am = reshape(A, (length(outv), size(A,4)))
@tullio (fun) outv[i] = first <| (z, Am[i,z]) init = (0,-Inf) threads=false
out
end
whichmax5Dt (generic function with 1 method)
julia> whichmax5Dt(A) == whichmax4D1(A)
true
julia> @btime whichmax5Dt($A);
2.017 μs (5 allocations: 4.98 KiB)
(I’ll eventually have LoopVectorization do this automatically, but for now you still need to do it manually. Although these benchmarks weren’t using LV anyway. I saw little performance difference w/ and w/out here)
[EDIT: More importantly, I’ll eventually make LV capable of handing this kind of reduction.]
full benchmarks
julia> using Tullio
julia> A = rand(3, 10, 20, 5);
julia> function whichmax4D1(array4d::Array{Float64, 4})
mapslices(argmax, array4d, dims=4)[:,:,:,]
end
whichmax4D1 (generic function with 1 method)
julia> @btime whichmax4D1($A);
426.615 μs (6059 allocations: 190.59 KiB)
julia> findmax4Dt(A) = @tullio (fun) out[i,j,k] := (z, A[i,j,k,z]) init = (0,-Inf);
julia> fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y));
julia> first.(findmax4Dt(A)) == whichmax4D1(A)
true
julia> @btime first.(findmax4Dt($A));
4.733 μs (2 allocations: 14.31 KiB)
julia> function whichmax4Dt(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
@tullio (fun) out[i,j,k] = first <| (z, A[i,j,k,z]) init = (0,-Inf) threads=false
end;
julia> whichmax4Dt(A) == whichmax4D1(A)
true
julia> @btime whichmax4Dt($A);
2.793 μs (1 allocation: 4.81 KiB)
julia> function whichmax5Dt(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
outv = vec(out);
Am = reshape(A, (length(outv), size(A,4)))
@tullio (fun) outv[i] = first <| (z, Am[i,z]) init = (0,-Inf) threads=false
end;
julia> function whichmax5Dt(A)
fun((i,x), (j,y)) = ifelse(x>y, (i,x), (j,y))
out = similar(A, Int, axes(A)[1:end-1])
outv = vec(out);
Am = reshape(A, (length(outv), size(A,4)))
@tullio (fun) outv[i] = first <| (z, Am[i,z]) init = (0,-Inf) threads=false
out
end
whichmax5Dt (generic function with 1 method)
julia> whichmax5Dt(A) == whichmax4D1(A)
true
julia> @btime whichmax5Dt($A);
2.017 μs (5 allocations: 4.98 KiB)
I tried but I don’t think its meaningful. If you take a look at the screenshot - I first timed the julia_command which took 12 seconds (doens’t usually take that long!), and then 3 separate calls to the function - which are all in similar range:
I’ll try to find an answer in JuliaCall documentation, or alternatively I’ll try to get the C code running in Julia for direct comparison.
Good to know Also thank you for comments on threading and tiling. I had to look up tiling, but it makes sense.
So does this mean that the size of the z dimension in some sense dictates whether threading is useful or not ? (As luck has it z is likely to be the smallest dimension in practice <10 - the others can all go wild)
whichmax4D6(A) = reshape(whichmax4D6(reshape(A, :, size(A)[end])), size(A)[1:end-1])
function whichmax4D6(A::AbstractMatrix)
result = zeros(Int, axes(A,1))
@inbounds for i in axes(A,1)
r, m = 1, typemin(eltype(A))
for j in axes(A,2)
v = A[i,j]
r, m = ifelse(v>m, (j,v), (r,m))
end
result[i] = r
end
result
end
julia> A = rand(3, 10, 20, 5); # original size
julia> @btime whichmax4D5($A); # simple loops by @leandromartinez98
5.992 μs (3 allocations: 4.92 KiB)
julia> @btime whichmax4D6($A); # faster than my best above!
1.721 μs (5 allocations: 5.02 KiB)
julia> @btime whichmax5Dt($A); # ... but not as fast as @Elrod's variant
1.454 μs (5 allocations: 4.98 KiB)
julia> A = rand(50, 50, 50, 50); # my large size above
julia> @btime whichmax4D6($A); # here slower than first.(findmax4Dt($A));
5.420 ms (6 allocations: 976.84 KiB)
julia> @btime whichmax4D5_parallel($A);
2.017 ms (25 allocations: 978.38 KiB)
julia> @btime whichmax4D6_threads($A);
2.348 ms (27 allocations: 978.45 KiB)
Times of the order of 500 ms is what it takes here including compilation, so it may be compiling every time.
(here the time for Alarge = rand(20,400,40,3); is 3.360 ms for the “equivalent” implementation as the C++ one (version 4D5 above), which is comparable to the 2.3 ms you reported).
That seems way too long. If you start a fresh session (R and Julia), do you still get those numbers?
I think we have the Julia side well optimized. On the larger 20x400x40x3 array, we’re down to 1 millisecond which is faster than the 4.5 ms of which_max_C when called from R.
I suspect the issue now is with how the array is transferred via JuliaCall. Perhaps @Non-Contradiction or @ChrisRackauckas has some insight about how to most efficiently transfer large arrays using JuliaCall.
Thanks for the information.
The current implementation of JuliaCall is on the “safe” side, so users have little to worry about. I have thought about having something like julia_unsafe_call which does not do things like automatic type conversion/data copying. I will think more about the idea when I have more time on this.
Another thing is that JuliaCall currently uses RCall.rcopy for data copying. I just noticed that the performance here seems to be magnitude slower than copy in julia. And in this case, the rcopy logic should be here: RCall.jl/base.jl at 0b5e9f119008d3a8471833566745311df79101ff · JuliaInterop/RCall.jl · GitHub
I’m wondering whether the performance for rcopy in this case can also be improved.
That all sounds awesome! To give you my perspective on how I’d like to use JuliaCall → I’m fluent enough in R to be effective at getting my work tasks done quickly and dabble with making packages etc (nothing released yet), I expect it would take me a while to get to that level in Julia. But I would like to splash in snippets of Julia here and there were R is particularly slow, as opposed to writing large functions outright. I know many people use Rcpp in this way, but I’ve always found C++ to be utterly impenetrable for me. Learning Julia seems more achievable for me! So from that point of view, achieving blistering speed via a julia_unsafe_call for short functions sounds very appealing, but, I do understand why its something you need to think carefully about also! In any case JuliaCall is already an awesome package - thank you!
I must say, the response to my query here has been great and I’m blown away at all the options presented, and speeds that can be achieved - thank you all for the time you have put into this!
To elaborate on the unsafe aspects, see the comment here:
The asymmetry of the transfers is immediately striking to me. I may be mistaken, but we do not see a lot of overhead of transferring the result from Julia to R. This suggests a strategy of starting with the variable on the Julia side.
Perhaps @randy3k or @simonbyrne could comment from the Rcall.jl perspective. I suspect julia_[unsafe]_call might not be that unsafe if julia_call could protect the data from R’s garbage collection. Some combination of robject, RCall.protect, and unsafe_array might work.