R's dplyr and data.table 2x faster than Julia's DataFrames.jl + libraries

I got this data from https://www.kaggle.com/c/home-credit-default-risk and I just downloaded the data and loaded them and run the below.

TLDR: R seems 2x as fast!

using DataFrames, Statistics, DataFramesMeta
using DataConvenience
using CSV

bureau_bal = CSV.read("bureau_balance.csv")
bureau = CSV.read("bureau.csv")

function ok(bureau_bal)
    @> bureau_bal begin
        @where(:STATUS .!= "C")
        @where(:MONTHS_BALANCE .> -23)
        @transform(STATUS = parse.(Int, replace(:STATUS, "X"=>"0")))
        groupby(:SK_ID_BUREAU)
        @based_on(worst_status_l12m = maximum(:STATUS))
        rightjoin(bureau, on = :SK_ID_BUREAU)
    end
end

@time bureau_bal_summ = ok(bureau_bal); # 5s
@time bureau_bal_summ = ok(bureau_bal); # 5s

And running the above timing twice I get about 5s but the same in R (either dplyr or data.table) took only 2.5s

bureau = data.table::fread("c:/data/home-credit-default-risk/bureau.csv")

bureau_bal = data.table::fread("c:/data/home-credit-default-risk/bureau_balance.csv")

library(dplyr)

system.time(bureau_bal_summ <- bureau_bal %>% 
   filter(STATUS != "C", MONTHS_BALANCE > -23) %>% 
   mutate(STATUS = ifelse(STATUS=="X", 0, as.integer(STATUS))) %>% 
   group_by(SK_ID_BUREAU) %>% 
   summarise(worst_status_l12m = max(STATUS)) %>% 
   right_join(bureau, by = "SK_ID_BUREAU"))
   

library(data.table)
setDT(bureau_bal)

system.time(bureau_bal_dt <- {
  bureau_bal[, STATUSn := 0L]
  bureau_bal[!STATUS %chin% c("X", "C"), STATUSn := as.integer(STATUS)]
  
  merge(
    bureau_bal[(STATUS != "C") & (MONTHS_BALANCE > -23), .(worst_status_l12m = max(STATUSn)), SK_ID_BUREAU],
    bureau, 
    by = "SK_ID_BUREAU",
    all.y = TRUE,
    all.x = FALSE
  )
})
2 Likes

What is it without the join at the end? I think that (joins) is a known pain point that they are trying to sort out.

Good question. I tried. Similar story. The right join seems to be quite small cost verse the rest.

As you know - I tend to use DataFrames.jl mainly and this is what I get:

julia> @time @pipe filter([:STATUS, :MONTHS_BALANCE] => (x,y) -> x != "C" && y > -23, bureau_bal) |>
             setindex!(_, (x -> x=="X" ? 0 : parse(Int, x)).(_.STATUS), !, :STATUS) |> # or transform!, which is a bit slower as it does more work, but nothing significant
             groupby(_, :SK_ID_BUREAU) |>
             combine(_, :STATUS => maximum => :worst_status_l12m) |>
             rightjoin(_, bureau, on = :SK_ID_BUREAU);
  2.223288 seconds (7.74 M allocations: 1.383 GiB, 6.63% gc time)

where the most expensive part is rightjoin that takes over 1 second (and as noted above it is known that this where there is much to be improved).

(and on my laptop R codes take ~3 seconds)

So the reason for slow performance is that convenience packages most probably do not generate an efficient low-level DataFrames.jl code.

13 Likes

So need to update DataFramesMeta.jl it sounds like

Yes - @pdeffebach is working on it.

1 Like

The story seems more complicated

If the data comes from reading a CSV then it’s fast but if the data comes from a JDF-saved dataframe then it’s slower.

the types in the array seem to make a big difference.

The new DataFrames backend for @transform, etc. was only merged into master 7 days ago, so any speed improvements won’t be reflected on the release branch

2 Likes

I haven’t tried the example code above but CSV.File is faster than CSV.read. I got 8x speedup on loading 8000 rows and 26000 cols.

that’s not timed though. Also, you need to convert it to DAtaFrames to use it. You need to time that. Also you need to set copycols=true