Utilizing Julia's Speed in R

HeavyBulb · November 26, 2021, 10:57am

I have a data frame of length 40,000 and a function programmed in R. It takes 20 minutes to iterate through the whole data set with the mapply function that allows to apply a function on each row of a data frame.

Would it bring any benefit in speed to call julia from R and call the R function from julia?

jules · November 26, 2021, 11:04am

It always depends.

20 minutes for 40,000 rows means that one row takes about 30ms. So it depends on if this function can be sufficiently sped up, as the for-loop speed of R should not factor in much compared to 30ms. That matters more for millions or billions of iterations.

Certainly it would bring no speed benefit at all to loop in Julia and call the R function in each iteration.

Can you describe more what the R function is actually doing? That could make it easier to guess if Julia can bring a speed benefit. One common example where Julia can’t help is if you’re running large matrix multiplications that are handed off to BLAS routines anyway. But if you’re doing a lot of looping and possibly allocate lots of unnecessary memory in your R function, then Julia could probably cut that down substantially.

Elrod · November 26, 2021, 11:28am

Depends on how R was installed.
If installed from a Linux distro, you’re probably right.
If downloaded from the web, e.g. from CRAN, it probably has an archaic reference BLAS.
CRAN has put approximately zero effort into performance.

HeavyBulb · November 26, 2021, 11:38am

The R function has actually a pretty simple purpose: to calculate the time difference between two times, but to only accounts for working hours.

So, the function makes a list of all dates between the two times and puts them at both ends of this list. Then it checks for every day if they are weekend or workday and sets working hours accordingly as a time period. What happens next is beyond me tbh. Some weird comparing of values, finding minima and maxima and calculating basic differences from those which ends in some intervals that get summed up in the end which is the final value. Hence, I don’t identify anything that’s more complex than basic arythmetic and no loops.

nilshg · November 26, 2021, 11:45am

Hard to tell for certain without more exact information on what exactly the function is doing, but tbh this doesn’t feel like something which should take 20 mins for 40k date pairs. Here’s a simple example calculating business days for two date columns in a length 40k DataFrame in Julia:

julia> using DataFrames, Dates, BusinessDays

julia> df = (date1 = rand(Date(2000):Day(1):Date(2010), 40_000), date2 = rand(Date(2011):Day(1):Date(2021), 40_000));

julia> cal = BusinessDays.BRSettlement()
BusinessDays.BRSettlement()

julia> BusinessDays.initcache(cal)

julia> using BenchmarkTools

julia> @btime bdays.(cal, df.date1, df.date2);
  8.141 ms (160007 allocations: 2.75 MiB)

I guess you might have to multiply the number of business days by 8 to get working hours, and there might be some other adjustments based on what you say, but again 20 minutes feels long.

HeavyBulb · November 26, 2021, 12:14pm

I think you misinterpreted the goal: It’s not calculating days but a period in seonds and only those that lay within changing working hours. That’s what makes this task so difficult.

For instance business hours are:
Mo-Fr 08:00-20:00
Sa, Su 10:00-18:00

Example times:
time1: 2021-11-25 19:30:00
time2: 2021-11-26 08:30:00

In this example, the function should return 60mins.

nilshg · November 26, 2021, 12:20pm

But that’s not hugely different, is it? It’s basically business days times 12, plus non-business days (although not clear how holidays should be treated?) times 8, potentially with some adjustment for the partial nature of the first and last day.

That’s more cumbersome to write than something I can knock out in two minutes, but it would surprise me if this additional logic would make the function more than a hundred times slower than what I provided above - which would then take 8 seconds, still quite a bit less than twenty minutes.

In any case the basic logic from my post applies: just write your function

function hours_count(t1, t2)
    ...
end

test that it does what you want with a few examples and then call hours_count.(df.datecol1, df.datecol2)

HeavyBulb · November 26, 2021, 6:52pm

You’re right, but the “some adjustment” part is stressing me out a lot. I finally found a working function for R and am a little hesistant to change anything. Considering how much time this project took me so far (6 weeks full time, where I tried many different things) and how close I am to finishing it, I’m also hesistant to deepen my a-couple-of-minutes experience with Julia. That’s why I’m only considering calling the function from Julia and not writing a new one.

lungben · November 26, 2021, 7:18pm

Your use case looks indeed like something that could be much faster in Julia than in R - possibly up to a factor of 100 (at least I have observed similar differences between Python and Julia - I have no experience with R but think it is similar to Python in this regard).
To get this speedup, however, you have to implement your hours_count function in Julia. If you have already the R version, this should not be too difficult because you have the R version to test against.

nilshg · November 26, 2021, 8:05pm

Ah, okay, so when you said “would it bring any benefit in speed to call julia from R and call the R function from julia” in your OP, you actually meant just using Julia to call the function you already have in R to speed it up. To that question the answer is no, I answered a similar question here a while ago with a link to a nice summary by Stefan of why Julia can’t magically speed up code written in other languages.

Now I don’t mean to sound condescending in any way so don’t get me wrong, but I think the “some adjustment” really wouldn’t be very hard to write in Julia (trust me I’m one of the more mediocre programmers on this forum!)

If you can get together five test cases and knock up an initial version of the function I’m sure people on here would be happy to help you get this over the line quite quickly.

Whether it makes sense to spend any effort on this of course really depends on how often you’ll need this function - when you say you’re quite close to finishing this project it might not be worth spending any time on this, but if you have to call it a few dozen more times before you finish your project then reducing the time from 20 minutes to 20 seconds might be well worth it!

dlakelan · November 26, 2021, 8:06pm

Unless this R function is proprietary or something, I suggest you post the R function here and someone will probably give you hints on how to do it in Julia. My guess is when we’re done the process will take less than 1s, probably much less.

jules · November 26, 2021, 10:45pm

It’s not calculating days but a period in seonds and only those that lay within changing working hours. That’s what makes this task so difficult.

For instance business hours are:
Mo-Fr 08:00-20:00
Sa, Su 10:00-18:00

Here’s a simple implementation of something like that. It looks like you always have business hours on a given day, so this is the most straightforward thing I could come up with:

using Dates
using Intervals
using DataFrames


function business_hours(date::Date)
    weekday = dayofweek(date)
    if 1 <= weekday <= 5
        DateTime(date, Time(8, 00)) .. DateTime(date, Time(20, 00))
    else
        DateTime(date, Time(10, 00)) .. DateTime(date, Time(18, 00))
    end
end

function business_seconds(datetime1, datetime2)
    interval = datetime1 .. datetime2
    days_to_iterate = Date(datetime1):Day(1):Date(datetime2)
    sum(days_to_iterate) do day
        span(intersect(interval, business_hours(day)))
    end |> Second
end


df = DataFrame(
    date1 = rand(DateTime(2018):Day(1):DateTime(2019), 40_000) .+ Second.(rand.(Ref(1:86400))),
    date2 = rand(DateTime(2020):Day(1):DateTime(2021), 40_000) .+ Second.(rand.(Ref(1:86400)))
)

For those test data it takes 3.5 seconds, but it depends on how far apart your dates are.

@time transform(df, [:date1, :date2] => ByRow(business_seconds))

3.499010 seconds (153 allocations: 945.469 KiB)
40000×3 DataFrame
   Row │ date1                date2                date1_date2_business_seconds 
       │ DateTime             DateTime             Second                       
───────┼────────────────────────────────────────────────────────────────────────
     1 │ 2018-04-22T20:45:00  2020-10-02T10:36:47  34929407 seconds
     2 │ 2018-11-16T13:51:33  2020-08-12T08:29:24  24791871 seconds
     3 │ 2018-08-31T22:46:41  2020-04-02T18:18:28  22659508 seconds
     4 │ 2018-07-26T00:07:57  2020-06-23T16:16:38  27303398 seconds
     5 │ 2018-10-06T06:00:43  2020-08-06T11:34:42  26192082 seconds

stevengj · November 27, 2021, 4:25am

Note that this algorithm has runtime proportional to the number of days in the interval, when it should be possible to implement this with O(1) complexity.

Note that this implementation also has O(#days) complexity (although it avoids allocating a list of days as in the R implementation).

It really doesn’t seem that hard to adjust for the endpoints. Things are a lot easier when you don’t have to tie yourself into knots to use “vectorized” functions and built-ins, and you can just write if statements and they are fast.

From the sound of it, it seems highly likely that an O(1) Julia implementation (following the performance tips, e.g. type-stable and non-allocating) will be orders of magnitude faster than your R version, so it’s probably worth the effort if this is performance-critical.

dlakelan · November 27, 2021, 5:39am

It seems likely that the biggest issues are to do with holidays. It should be possible to calculate O(1) assuming no holidays, and then do a perturbation after checking the holiday rules.

jules · November 27, 2021, 3:36pm

Ok I think this one is better, I got nerdsniped:

Now I calculate one vector with a cumsum of the full business hours per day in seconds. This means each business hours interval has to be looked up only once. Then later, I can just look up differences of full-day intervals in that vector and only have to compute the fractional parts separately. I didn’t spend much time checking this except with the one example from above, but it runs in about 2.5ms for 40,000 items. One could also add more complicated logic for the business hours, I went with the simple example from above without holidays etc.

using Dates
using Intervals
using DataFrames


function business_intervals(date::Date)
    weekday = dayofweek(date)
    if 1 <= weekday <= 5
        DateTime(date, Time(8, 00)) .. DateTime(date, Time(20, 00))
    else
        DateTime(date, Time(10, 00)) .. DateTime(date, Time(18, 00))
    end
end

function business_seconds(d1s, d2s)
    # find first and last dates
    mi, ma = extrema([extrema(d1s)..., extrema(d2s)...])
    dmi = Date(mi)
    dma = Date(ma)
    all_days = dmi:Day(1):dma
    # query each day's business hours once
    all_time_intervals = business_intervals.(all_days)
    # accumulate durations in seconds over all days
    # durations between two full days can then be computed with two lookups and a difference
    cumulative_business_seconds = cumsum(Second(span(int)) for int in all_time_intervals)
    map(d1s, d2s) do d1, d2
        interval = d1 .. d2
        # compute day indices for lookup
        i1 = Dates.days(Date(d1) - dmi) + 1
        i2 = Dates.days(Date(d2) - dmi) + 1
        # compute durations on full days by direct lookup
        full_days_seconds = cumulative_business_seconds[i2 - 1] - cumulative_business_seconds[i1]
        first_day_seconds = Second(span(intersect(all_time_intervals[i1], interval)))
        total = first_day_seconds + full_days_seconds
        # avoid double dipping if both times are on the same day
        if i2 > i1
            last_day_seconds = Second(span(intersect(all_time_intervals[i2], interval)))
            total += last_day_seconds
        end
        total
    end
end

julia> df = DataFrame(
           date1 = rand(DateTime(2018):Day(1):DateTime(2019), 40_000) .+ Second.(rand.(Ref(1:86400))),
           date2 = rand(DateTime(2020):Day(1):DateTime(2021), 40_000) .+ Second.(rand.(Ref(1:86400)))
       );

julia> business_seconds(
           [DateTime(2021, 11, 25, 19, 30, 00)],
           [DateTime(2021, 11, 26, 08, 30, 00)],
       )
1-element Vector{Second}:
 3600 seconds

julia> @time business_seconds(df.date1, df.date2);
  0.002505 seconds (6 allocations: 338.703 KiB)

HeavyBulb · November 29, 2021, 1:44pm

Thanks so much to all of you and specifically to @jules . That’s more than I ever expected. Maybe this function will be a great start into Julia for me. It looks pretty similar to the R-Version if I’m seeing this correctly.

Topic		Replies	Views
How do DataFrames.jl compare to R's? And Interoperability between R and Julia General Usage	23	6504	January 3, 2018
Fast 4D argmax Performance tullio	26	2137	April 6, 2021
With Missings, Julia is slower than R General Usage	30	4243	February 26, 2021
OpenBLAS: Julia slower than R Performance linearalgebra	41	7818	March 26, 2019
How to call my R package by RCall package inside julia? General Usage	8	1657	December 17, 2020

Utilizing Julia's Speed in R

Related topics