R's dplyr and data.table 2x faster than Julia's DataFrames.jl + libraries

xiaodai · September 29, 2020, 4:56am

I got this data from Home Credit Default Risk | Kaggle and I just downloaded the data and loaded them and run the below.

TLDR: R seems 2x as fast!

using DataFrames, Statistics, DataFramesMeta
using DataConvenience
using CSV

bureau_bal = CSV.read("bureau_balance.csv")
bureau = CSV.read("bureau.csv")

function ok(bureau_bal)
    @> bureau_bal begin
        @where(:STATUS .!= "C")
        @where(:MONTHS_BALANCE .> -23)
        @transform(STATUS = parse.(Int, replace(:STATUS, "X"=>"0")))
        groupby(:SK_ID_BUREAU)
        @based_on(worst_status_l12m = maximum(:STATUS))
        rightjoin(bureau, on = :SK_ID_BUREAU)
    end
end

@time bureau_bal_summ = ok(bureau_bal); # 5s
@time bureau_bal_summ = ok(bureau_bal); # 5s

And running the above timing twice I get about 5s but the same in R (either dplyr or data.table) took only 2.5s

bureau = data.table::fread("c:/data/home-credit-default-risk/bureau.csv")

bureau_bal = data.table::fread("c:/data/home-credit-default-risk/bureau_balance.csv")

library(dplyr)

system.time(bureau_bal_summ <- bureau_bal %>% 
   filter(STATUS != "C", MONTHS_BALANCE > -23) %>% 
   mutate(STATUS = ifelse(STATUS=="X", 0, as.integer(STATUS))) %>% 
   group_by(SK_ID_BUREAU) %>% 
   summarise(worst_status_l12m = max(STATUS)) %>% 
   right_join(bureau, by = "SK_ID_BUREAU"))
   

library(data.table)
setDT(bureau_bal)

system.time(bureau_bal_dt <- {
  bureau_bal[, STATUSn := 0L]
  bureau_bal[!STATUS %chin% c("X", "C"), STATUSn := as.integer(STATUS)]
  
  merge(
    bureau_bal[(STATUS != "C") & (MONTHS_BALANCE > -23), .(worst_status_l12m = max(STATUSn)), SK_ID_BUREAU],
    bureau, 
    by = "SK_ID_BUREAU",
    all.y = TRUE,
    all.x = FALSE
  )
})

tbeason · September 29, 2020, 5:35am

What is it without the join at the end? I think that (joins) is a known pain point that they are trying to sort out.

xiaodai · September 29, 2020, 5:38am

Good question. I tried. Similar story. The right join seems to be quite small cost verse the rest.

bkamins · September 29, 2020, 7:19am

As you know - I tend to use DataFrames.jl mainly and this is what I get:

julia> @time @pipe filter([:STATUS, :MONTHS_BALANCE] => (x,y) -> x != "C" && y > -23, bureau_bal) |>
             setindex!(_, (x -> x=="X" ? 0 : parse(Int, x)).(_.STATUS), !, :STATUS) |> # or transform!, which is a bit slower as it does more work, but nothing significant
             groupby(_, :SK_ID_BUREAU) |>
             combine(_, :STATUS => maximum => :worst_status_l12m) |>
             rightjoin(_, bureau, on = :SK_ID_BUREAU);
  2.223288 seconds (7.74 M allocations: 1.383 GiB, 6.63% gc time)

where the most expensive part is rightjoin that takes over 1 second (and as noted above it is known that this where there is much to be improved).

(and on my laptop R codes take ~3 seconds)

So the reason for slow performance is that convenience packages most probably do not generate an efficient low-level DataFrames.jl code.

xiaodai · September 29, 2020, 7:22am

So need to update DataFramesMeta.jl it sounds like

bkamins · September 29, 2020, 7:25am

Yes - @pdeffebach is working on it.

xiaodai · September 29, 2020, 9:49am

The story seems more complicated

If the data comes from reading a CSV then it’s fast but if the data comes from a JDF-saved dataframe then it’s slower.

the types in the array seem to make a big difference.

pdeffebach · September 29, 2020, 12:43pm

The new DataFrames backend for @transform, etc. was only merged into master 7 days ago, so any speed improvements won’t be reflected on the release branch

vtomar · September 30, 2020, 3:11am

I haven’t tried the example code above but CSV.File is faster than CSV.read. I got 8x speedup on loading 8000 rows and 26000 cols.

xiaodai · September 30, 2020, 3:17am

that’s not timed though. Also, you need to convert it to DAtaFrames to use it. You need to time that. Also you need to set copycols=true

Topic		Replies	Views
Julia's DataFrames.jl performance on join benchmark Community dataframes	1	1341	November 6, 2019
A living post of Julia vs R's data manipulation tasks speeds Data data	21	7786	August 27, 2021
DataFrames.jl data engineering performance compared with other softwares Performance performance	6	949	November 10, 2021
Package for tabular data Data	12	1532	November 23, 2018
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9377	January 1, 2025

R's dplyr and data.table 2x faster than Julia's DataFrames.jl + libraries

Related topics