- Don’t benchmark in global scope (performance tips)
- Use BenchmarkTools for benchmarking
- The first run includes compile time
The issue here is likely using Missing
. If you have to do that, then you’re not on Julia v0.7-alpha. The optimizations to make this fast only exist on the newest alpha version.
Using BenchmarkTools (as suggested by Stefan) and Julia v0.7-alpha (as suggested by Chris), I see Julia being only slightly slower than R:
Edit: And the very latest Julia nightlies are even faster than R (see below)
julia> using BenchmarkTools
julia> @btime sum(skipmissing($y))
1.329 s (6 allocations: 96 bytes)
> system.time(sum(x, na.rm = T))
user system elapsed
0.940 0.000 0.941
That’s actually pretty good, since presumably R’s sum()
is just calling a C function. Calling a single (C-implemented) function on a vector of data is pretty much the best possible case for R/NumPy/MATLAB/etc.
The important difference is that Julia’s sum(skipmissing(y))
is just calling Julia code all the way down. So if you want to implement your own kind of summation, you actually can and it will still be fast. Compare:
julia> function mysum(x)
result = zero(eltype(x))
for element in x
if !ismissing(element)
result += element
end
end
result
end
mysum (generic function with 1 method)
julia> @btime mysum($y)
809.128 ms (0 allocations: 0 bytes)
vs R:
> mysum <- function(x) {
+ result <- 0
+ for (element in x) {
+ if (!is.na(element)) {
+ result <- result + element
+ }
+ }
+ result
+ }
> system.time(mysum(x))
user system elapsed
107.094 0.000 107.093
The hand-written R
sum is over 100 times slower than Julia.
So yes, if all you ever want to do is compute sums over vectors, then R is fine. But if you want to do something that doesn’t have a fast built-in C implementation, then you really want Julia.
By the way,
Are sentinelle values to represent missing values really out of the question?
No, certainly not. Julia v0.7 stores vectors of things like Union{Float64, Missing}
efficiently by using a sentinel bit for each element. Or you can implement your own, which is what DataArrays does: https://github.com/JuliaStats/DataArrays.jl/blob/49004b5f82bb92ec7790805a0dda10b8c3aeb68b/src/dataarray.jl#L32-L70 and it will be fast, too.
Why is your sum
implementation so much faster than the one with skipmissing
though? What’s going on in the skipmissing
part that’s slowing it down? I assumed it was just an iterator that skipped over the missing values which should be almost equivalent to what you have?
Yeah, I was surprised by that as well. It also allocates some memory, even if I omit the skipmissing()
itself:
julia> @btime sum($(skipmissing(y)))
1.323 s (5 allocations: 80 bytes)
It looks like the implementation of SkipMissing
iteration hasn’t been updated to use the new iterate()
protocol as of v0.7-alpha, but it has been updated on master
. Perhaps this will be faster on master
.
Holy cow! Yeah, it’s 4 times faster (and more than 2X faster than R) on the very latest Julia nightly:
julia> versioninfo()
Julia Version 0.7.0-alpha.214
Commit ffc9182b3a (2018-06-20 19:45 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-3920XM CPU @ 2.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, ivybridge)
julia> @btime sum(skipmissing($y))
381.473 ms (1 allocation: 16 bytes)
I’m impressed!
Hi,
This does not change anything. I obtain these results in v0.7, even with BenchmarkTools and using local scope. See below.
To be more precise, I obtain 0.5s in Julia vs 0.2s in R
using BenchmarkTools, Compat
x = rand(200_000_000)
y = replace(x-> x<= 0.01, x, missing)
function f(y)
@btime sum(skipmissing(y))
end
f(y)
versioninfo()
Julia Version 0.7.0-alpha.0
Commit 22590d529d (2018-05-31 00:07 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
See the two last sections of my blog post. We get an incredibly high performance on recent git master for a manual sum over Vector{Int}
, a similar performance as R for a manual sum over Vector{Float64}
, but a somewhat poorer performance using skipmissing
. I’ve filed this issue about it yesterday.
Anyway, as @rdeits has noted, R’s sum
is implemented in C, so being only twice slower than it in pure Julia is already quite good. If you write a sum in pure R, it will be much slower than that! More fundamentally, the fact that writing the loop manually gives a better performance means that we should be able to match R relatively easily, and probably beat it at least in some cases.
y
is still a non-const global variable. See https://docs.julialang.org/en/stable/manual/performance-tips/#Avoid-global-variables-1 and compare to how rdeits
is doing the benchmarking in With Missings, Julia is slower than R - #4 by rdeits.
I thought that your use of zero(eltype(x))
would be problematic, but apparently zero(Union{Missing, T})
is equivalent to zero(T)
.
What is the general rule governing this? (Edit: promote_type
, maybe?)
The general rule is that missing
behaves as if it were some value in the domain T
but you don’t know which one. So zero(Union{Missing, T}) = zero(T)
makes sense since the zero of a missing T
value is just the zero of T
.
Thanks to nalimilan for being on this already!
Same issue happens with sort though. Julia takes 1.5 R to sort a Float64 vector without missing values, but this jumps to 3.5 R to sort a Float64 vector with missing values. Do you think there will be a way to make sorting roughly as fast as R?
For @rdeits, I understand that Julia is much more flexible than R, but Julia is also supposed to be a more performant language. Given the benchmarks that are on the first page on the Julia website, it is reasonable to expect Julia to be roughly on par with R functions, even the ones coded in C. More importantly, making sure that basic library functions such as sum
or sort
are efficient for the case of Float64 is essential, because these functions will be called very frequently.
Can you post (a) the Julia and R code you’re using, (b) the actual results, and (c)
the exact versions of both? Without that information it’s impossible to say anything at all.
# Julia Version 0.7.0-beta.9
Y1 = rand(Float64, 10_000_000);
Y2 = ifelse.(rand(length(Y1)) .< 0.9, Y1, missing);
@btime sort($Y1)
#> 764.978 ms (2 allocations: 76.29 MiB)
@btime sort($Y2)
#> 1.797 s (4 allocations: 128.75 MiB)
# in R version 3.5.0 (2018-04-23)
Y1 = rnorm(10000000)
Y2 = ifelse(runif(length(Y1)) < 0.9, Y1, NA)
system.time(sort(Y1))
#> user system elapsed
#> 0.616 0.043 0.670
system.time(sort(Y2))
#> user system elapsed
#> 0.598 0.038 0.640
I get essentially no difference between Julia v0.7.0-beta.5 and R 3.4.4 when sorting Float64s on my 2018 i7 laptop:
julia> @btime sort(y) setup=(y = rand(Float64, 10_000_000))
906.928 ms (2 allocations: 76.29 MiB)
(I’m putting the generation of y into the setup to also randomize the initial sorted-ness of the data, which substantially affects run-time).
> Y1 = rnorm(10000000)
^[system.time(sort(Y1))
user system elapsed
0.963 0.060 1.047
although I do see the missing
version being substantially slower in Julia:
julia> @btime sort(y2) setup=(y = rand(Float64, 10_000_000); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing))
2.186 s (4 allocations: 128.75 MiB)
I think this actually indicates a real performance bug. Julia has an in-place sort which does not allocate memory:
julia> @btime sort!(y) setup=(y = rand(100)) evals=1
2.414 μs (0 allocations: 0 bytes)
and the default sort(x)
just copies x
and then does sort!(x_copy)
. However, that in-place sort seems to allocate memory when it shouldn’t for arrays with Missings:
julia> @btime sort!(y2) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing)) evals=1
4.157 μs (2 allocations: 624 bytes)
That could be a real (and fixable) performance bug.
Good catch! Looking at the time profile and at allocations, I couldn’t find anything obvious. It would be worth filing an issue.
Keep up testing performance! It’s hard to think about all possible tests.
Is it possible that it’s not using the floating-point quicksort code and is using a merge sort instead?
Seems like it.
julia> Base.Sort.defalg(y2)
Base.Sort.MergeSortAlg()
julia> @btime sort(y2) setup=(y = rand(Float64, 10_000_000); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
1.635 s (4 allocations: 128.75 MiB)
julia> @btime sort(y2, alg=MergeSort) setup=(y = rand(Float64, 10_000_000); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
1.638 s (5 allocations: 128.75 MiB)
julia> @btime sort(y2, alg=QuickSort) setup=(y = rand(Float64, 10_000_000); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
1.432 s (3 allocations: 85.83 MiB)
Note that until the fixes are merged, a relatively straightforward solution is to use sort!(collect(skipmissing(Y2)), alg=QuickSort)
. It’s about as fast as R and twice as fast as sort(Y2)
. It’s also what R does by default (na.last=NA
), i.e. skipping NAs (keeping NAs doesn’t affect performance).
We should probably define sort(itr::SkipMissing)
to call sort!(collect(itr), ...)
, since that’s a common need and it’s the most efficient possible definition.