With Missings, Julia is slower than R


#1

As shown by the code below, Julia is twice slower than R (!) to sum elements of a Float64 vector with missing values, even with Julia 0.7.
In R, missing values are encoded with sentinelle values. In Julia, they are encoded with different type.
This difference of design makes it much slower to manipulate vectors of Float64 with missing s n Julia.
Are sentinelle values to represent missing values really out of the question?

# in Julia
using Missings
x = rand(200_000_000)
y = replace(x-> x<= 0.01, x, missing)
@time sum(skipmissing(y))

# in R
x = runif(2 * 1e8)
x[x<=0.01] = NA
system.time(sum(x, na.rm = T))
``

#2

#3

The issue here is likely using Missing. If you have to do that, then you’re not on Julia v0.7-alpha. The optimizations to make this fast only exist on the newest alpha version.


#4

Using BenchmarkTools (as suggested by Stefan) and Julia v0.7-alpha (as suggested by Chris), I see Julia being only slightly slower than R:

Edit: And the very latest Julia nightlies are even faster than R (see below)

julia> using BenchmarkTools

julia> @btime sum(skipmissing($y))
  1.329 s (6 allocations: 96 bytes)
> system.time(sum(x, na.rm = T))
   user  system elapsed 
  0.940   0.000   0.941

That’s actually pretty good, since presumably R’s sum() is just calling a C function. Calling a single (C-implemented) function on a vector of data is pretty much the best possible case for R/NumPy/MATLAB/etc.

The important difference is that Julia’s sum(skipmissing(y)) is just calling Julia code all the way down. So if you want to implement your own kind of summation, you actually can and it will still be fast. Compare:

julia> function mysum(x)
         result = zero(eltype(x))
         for element in x
           if !ismissing(element)
             result += element
           end
         end
         result
       end
mysum (generic function with 1 method)

julia> @btime mysum($y)
  809.128 ms (0 allocations: 0 bytes)

vs R:

> mysum <- function(x) {
+   result <- 0
+   for (element in x) {
+     if (!is.na(element)) {
+       result <- result + element
+     }
+   }
+  result
+ }
> system.time(mysum(x))
   user  system elapsed 
107.094   0.000 107.093 

The hand-written R sum is over 100 times slower than Julia.

So yes, if all you ever want to do is compute sums over vectors, then R is fine. But if you want to do something that doesn’t have a fast built-in C implementation, then you really want Julia.

By the way,

Are sentinelle values to represent missing values really out of the question?

No, certainly not. Julia v0.7 stores vectors of things like Union{Float64, Missing} efficiently by using a sentinel bit for each element. Or you can implement your own, which is what DataArrays does: https://github.com/JuliaStats/DataArrays.jl/blob/49004b5f82bb92ec7790805a0dda10b8c3aeb68b/src/dataarray.jl#L32-L70 and it will be fast, too.


#5

Why is your sum implementation so much faster than the one with skipmissing though? What’s going on in the skipmissing part that’s slowing it down? I assumed it was just an iterator that skipped over the missing values which should be almost equivalent to what you have?


#6

Yeah, I was surprised by that as well. It also allocates some memory, even if I omit the skipmissing() itself:

julia> @btime sum($(skipmissing(y)))
  1.323 s (5 allocations: 80 bytes)

It looks like the implementation of SkipMissing iteration hasn’t been updated to use the new iterate() protocol as of v0.7-alpha, but it has been updated on master. Perhaps this will be faster on master.


#7

Holy cow! Yeah, it’s 4 times faster (and more than 2X faster than R) on the very latest Julia nightly:

julia> versioninfo()
Julia Version 0.7.0-alpha.214
Commit ffc9182b3a (2018-06-20 19:45 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-3920XM CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, ivybridge)

julia> @btime sum(skipmissing($y))
  381.473 ms (1 allocation: 16 bytes)

I’m impressed!


Is master converging to v0.7?
#8

Hi,
This does not change anything. I obtain these results in v0.7, even with BenchmarkTools and using local scope. See below.
To be more precise, I obtain 0.5s in Julia vs 0.2s in R

using BenchmarkTools, Compat
x = rand(200_000_000)
y = replace(x-> x<= 0.01, x, missing)
function f(y)
	@btime sum(skipmissing(y))
end
f(y)
versioninfo()
Julia Version 0.7.0-alpha.0
Commit 22590d529d (2018-05-31 00:07 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-7567U CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

#9

See the two last sections of my blog post. We get an incredibly high performance on recent git master for a manual sum over Vector{Int}, a similar performance as R for a manual sum over Vector{Float64}, but a somewhat poorer performance using skipmissing. I’ve filed this issue about it yesterday.

Anyway, as @rdeits has noted, R’s sum is implemented in C, so being only twice slower than it in pure Julia is already quite good. If you write a sum in pure R, it will be much slower than that! More fundamentally, the fact that writing the loop manually gives a better performance means that we should be able to match R relatively easily, and probably beat it at least in some cases.


#10

y is still a non-const global variable. See https://docs.julialang.org/en/stable/manual/performance-tips/#Avoid-global-variables-1 and compare to how rdeits is doing the benchmarking in With Missings, Julia is slower than R.


#11

I thought that your use of zero(eltype(x)) would be problematic, but apparently zero(Union{Missing, T}) is equivalent to zero(T).

What is the general rule governing this? (Edit: promote_type, maybe?)


#12

The general rule is that missing behaves as if it were some value in the domain T but you don’t know which one. So zero(Union{Missing, T}) = zero(T) makes sense since the zero of a missing T value is just the zero of T.


#13

Thanks to nalimilan for being on this already!

Same issue happens with sort though. Julia takes 1.5 R to sort a Float64 vector without missing values, but this jumps to 3.5 R to sort a Float64 vector with missing values. Do you think there will be a way to make sorting roughly as fast as R?

For @rdeits, I understand that Julia is much more flexible than R, but Julia is also supposed to be a more performant language. Given the benchmarks that are on the first page on the Julia website, it is reasonable to expect Julia to be roughly on par with R functions, even the ones coded in C. More importantly, making sure that basic library functions such as sum or sort are efficient for the case of Float64 is essential, because these functions will be called very frequently.


#14

Can you post (a) the Julia and R code you’re using, (b) the actual results, and (c) the exact versions of both? Without that information it’s impossible to say anything at all.


#15
# Julia Version 0.7.0-beta.9
Y1 = rand(Float64, 10_000_000);
Y2 = ifelse.(rand(length(Y1)) .< 0.9, Y1, missing);
 @btime sort($Y1)
#>   764.978 ms (2 allocations: 76.29 MiB)
 @btime sort($Y2)
#>  1.797 s (4 allocations: 128.75 MiB)

# in R version 3.5.0 (2018-04-23)
Y1 = rnorm(10000000)
Y2 = ifelse(runif(length(Y1)) < 0.9, Y1, NA)
system.time(sort(Y1))
#>    user  system elapsed 
#>  0.616   0.043   0.670
system.time(sort(Y2))
#>   user  system elapsed 
#>  0.598   0.038   0.640 

#16

I get essentially no difference between Julia v0.7.0-beta.5 and R 3.4.4 when sorting Float64s on my 2018 i7 laptop:

julia> @btime sort(y) setup=(y = rand(Float64, 10_000_000))
  906.928 ms (2 allocations: 76.29 MiB)

(I’m putting the generation of y into the setup to also randomize the initial sorted-ness of the data, which substantially affects run-time).

> Y1 = rnorm(10000000)
^[system.time(sort(Y1))
   user  system elapsed 
  0.963   0.060   1.047 

although I do see the missing version being substantially slower in Julia:

julia> @btime sort(y2) setup=(y = rand(Float64, 10_000_000); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing))
  2.186 s (4 allocations: 128.75 MiB)

I think this actually indicates a real performance bug. Julia has an in-place sort which does not allocate memory:

julia> @btime sort!(y) setup=(y = rand(100)) evals=1
  2.414 μs (0 allocations: 0 bytes)

and the default sort(x) just copies x and then does sort!(x_copy). However, that in-place sort seems to allocate memory when it shouldn’t for arrays with Missings:

julia> @btime sort!(y2) setup=(y = rand(100); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing)) evals=1
  4.157 μs (2 allocations: 624 bytes)

That could be a real (and fixable) performance bug.


#17

Good catch! Looking at the time profile and at allocations, I couldn’t find anything obvious. It would be worth filing an issue.

Keep up testing performance! It’s hard to think about all possible tests.


#18

Is it possible that it’s not using the floating-point quicksort code and is using a merge sort instead?


#19

x-ref https://github.com/JuliaLang/julia/issues/27781


#20

Seems like it.

julia> Base.Sort.defalg(y2)
Base.Sort.MergeSortAlg()

julia> @btime sort(y2) setup=(y = rand(Float64, 10_000_000); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
  1.635 s (4 allocations: 128.75 MiB)

julia> @btime sort(y2, alg=MergeSort) setup=(y = rand(Float64, 10_000_000); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
  1.638 s (5 allocations: 128.75 MiB)

julia> @btime sort(y2, alg=QuickSort) setup=(y = rand(Float64, 10_000_000); y2 = ifelse.(rand(length(y)) .< 0.9, y, missing));
  1.432 s (3 allocations: 85.83 MiB)