# Utilizing Julia's Speed in R

I have a data frame of length 40,000 and a function programmed in R. It takes 20 minutes to iterate through the whole data set with the mapply function that allows to apply a function on each row of a data frame.

Would it bring any benefit in speed to call julia from R and call the R function from julia?

It always depends.

20 minutes for 40,000 rows means that one row takes about 30ms. So it depends on if this function can be sufficiently sped up, as the for-loop speed of R should not factor in much compared to 30ms. That matters more for millions or billions of iterations.

Certainly it would bring no speed benefit at all to loop in Julia and call the R function in each iteration.

Can you describe more what the R function is actually doing? That could make it easier to guess if Julia can bring a speed benefit. One common example where Julia canâ€™t help is if youâ€™re running large matrix multiplications that are handed off to BLAS routines anyway. But if youâ€™re doing a lot of looping and possibly allocate lots of unnecessary memory in your R function, then Julia could probably cut that down substantially.

Depends on how R was installed.
If installed from a Linux distro, youâ€™re probably right.
If downloaded from the web, e.g. from CRAN, it probably has an archaic reference BLAS.
CRAN has put approximately zero effort into performance.

3 Likes

The R function has actually a pretty simple purpose: to calculate the time difference between two times, but to only accounts for working hours.

So, the function makes a list of all dates between the two times and puts them at both ends of this list. Then it checks for every day if they are weekend or workday and sets working hours accordingly as a time period. What happens next is beyond me tbh. Some weird comparing of values, finding minima and maxima and calculating basic differences from those which ends in some intervals that get summed up in the end which is the final value. Hence, I donâ€™t identify anything thatâ€™s more complex than basic arythmetic and no loops.

Hard to tell for certain without more exact information on what exactly the function is doing, but tbh this doesnâ€™t feel like something which should take 20 mins for 40k date pairs. Hereâ€™s a simple example calculating business days for two date columns in a length 40k DataFrame in Julia:

``````julia> using DataFrames, Dates, BusinessDays

julia> df = (date1 = rand(Date(2000):Day(1):Date(2010), 40_000), date2 = rand(Date(2011):Day(1):Date(2021), 40_000));

julia> using BenchmarkTools

julia> @btime bdays.(cal, df.date1, df.date2);
8.141 ms (160007 allocations: 2.75 MiB)
``````

I guess you might have to multiply the number of business days by 8 to get working hours, and there might be some other adjustments based on what you say, but again 20 minutes feels long.

6 Likes

I think you misinterpreted the goal: Itâ€™s not calculating days but a period in seonds and only those that lay within changing working hours. Thatâ€™s what makes this task so difficult.

Mo-Fr 08:00-20:00
Sa, Su 10:00-18:00

Example times:
time1: 2021-11-25 19:30:00
time2: 2021-11-26 08:30:00

In this example, the function should return 60mins.

But thatâ€™s not hugely different, is it? Itâ€™s basically business days times 12, plus non-business days (although not clear how holidays should be treated?) times 8, potentially with some adjustment for the partial nature of the first and last day.

Thatâ€™s more cumbersome to write than something I can knock out in two minutes, but it would surprise me if this additional logic would make the function more than a hundred times slower than what I provided above - which would then take 8 seconds, still quite a bit less than twenty minutes.

In any case the basic logic from my post applies: just write your function

``````function hours_count(t1, t2)
...
end
``````

test that it does what you want with a few examples and then call `hours_count.(df.datecol1, df.datecol2)`

1 Like

Youâ€™re right, but the â€śsome adjustmentâ€ť part is stressing me out a lot. I finally found a working function for R and am a little hesistant to change anything. Considering how much time this project took me so far (6 weeks full time, where I tried many different things) and how close I am to finishing it, Iâ€™m also hesistant to deepen my a-couple-of-minutes experience with Julia. Thatâ€™s why Iâ€™m only considering calling the function from Julia and not writing a new one.

Your use case looks indeed like something that could be much faster in Julia than in R - possibly up to a factor of 100 (at least I have observed similar differences between Python and Julia - I have no experience with R but think it is similar to Python in this regard).
To get this speedup, however, you have to implement your `hours_count` function in Julia. If you have already the R version, this should not be too difficult because you have the R version to test against.

Ah, okay, so when you said â€śwould it bring any benefit in speed to call julia from R and call the R function from juliaâ€ť in your OP, you actually meant just using Julia to call the function you already have in R to speed it up. To that question the answer is no, I answered a similar question here a while ago with a link to a nice summary by Stefan of why Julia canâ€™t magically speed up code written in other languages.

Now I donâ€™t mean to sound condescending in any way so donâ€™t get me wrong, but I think the â€śsome adjustmentâ€ť really wouldnâ€™t be very hard to write in Julia (trust me Iâ€™m one of the more mediocre programmers on this forum!)

If you can get together five test cases and knock up an initial version of the function Iâ€™m sure people on here would be happy to help you get this over the line quite quickly.

Whether it makes sense to spend any effort on this of course really depends on how often youâ€™ll need this function - when you say youâ€™re quite close to finishing this project it might not be worth spending any time on this, but if you have to call it a few dozen more times before you finish your project then reducing the time from 20 minutes to 20 seconds might be well worth it!

1 Like

Unless this R function is proprietary or something, I suggest you post the R function here and someone will probably give you hints on how to do it in Julia. My guess is when weâ€™re done the process will take less than 1s, probably much less.

3 Likes

Itâ€™s not calculating days but a period in seonds and only those that lay within changing working hours. Thatâ€™s what makes this task so difficult.

Mo-Fr 08:00-20:00
Sa, Su 10:00-18:00

Hereâ€™s a simple implementation of something like that. It looks like you always have business hours on a given day, so this is the most straightforward thing I could come up with:

``````using Dates
using Intervals
using DataFrames

weekday = dayofweek(date)
if 1 <= weekday <= 5
DateTime(date, Time(8, 00)) .. DateTime(date, Time(20, 00))
else
DateTime(date, Time(10, 00)) .. DateTime(date, Time(18, 00))
end
end

interval = datetime1 .. datetime2
days_to_iterate = Date(datetime1):Day(1):Date(datetime2)
sum(days_to_iterate) do day
end |> Second
end

df = DataFrame(
date1 = rand(DateTime(2018):Day(1):DateTime(2019), 40_000) .+ Second.(rand.(Ref(1:86400))),
date2 = rand(DateTime(2020):Day(1):DateTime(2021), 40_000) .+ Second.(rand.(Ref(1:86400)))
)

``````

For those test data it takes 3.5 seconds, but it depends on how far apart your dates are.

``````@time transform(df, [:date1, :date2] => ByRow(business_seconds))

3.499010 seconds (153 allocations: 945.469 KiB)
40000Ă—3 DataFrame
â”‚ DateTime             DateTime             Second
â”€â”€â”€â”€â”€â”€â”€â”Ľâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
1 â”‚ 2018-04-22T20:45:00  2020-10-02T10:36:47  34929407 seconds
2 â”‚ 2018-11-16T13:51:33  2020-08-12T08:29:24  24791871 seconds
3 â”‚ 2018-08-31T22:46:41  2020-04-02T18:18:28  22659508 seconds
4 â”‚ 2018-07-26T00:07:57  2020-06-23T16:16:38  27303398 seconds
5 â”‚ 2018-10-06T06:00:43  2020-08-06T11:34:42  26192082 seconds
``````

Note that this algorithm has runtime proportional to the number of days in the interval, when it should be possible to implement this with O(1) complexity.

Note that this implementation also has O(#days) complexity (although it avoids allocating a list of days as in the R implementation).

It really doesnâ€™t seem that hard to adjust for the endpoints. Things are a lot easier when you donâ€™t have to tie yourself into knots to use â€śvectorizedâ€ť functions and built-ins, and you can just write `if` statements and they are fast.

From the sound of it, it seems highly likely that an O(1) Julia implementation (following the performance tips, e.g. type-stable and non-allocating) will be orders of magnitude faster than your R version, so itâ€™s probably worth the effort if this is performance-critical.

3 Likes

It seems likely that the biggest issues are to do with holidays. It should be possible to calculate O(1) assuming no holidays, and then do a perturbation after checking the holiday rules.

1 Like

Ok I think this one is better, I got nerdsniped:

Now I calculate one vector with a cumsum of the full business hours per day in seconds. This means each business hours interval has to be looked up only once. Then later, I can just look up differences of full-day intervals in that vector and only have to compute the fractional parts separately. I didnâ€™t spend much time checking this except with the one example from above, but it runs in about 2.5ms for 40,000 items. One could also add more complicated logic for the business hours, I went with the simple example from above without holidays etc.

``````using Dates
using Intervals
using DataFrames

weekday = dayofweek(date)
if 1 <= weekday <= 5
DateTime(date, Time(8, 00)) .. DateTime(date, Time(20, 00))
else
DateTime(date, Time(10, 00)) .. DateTime(date, Time(18, 00))
end
end

# find first and last dates
mi, ma = extrema([extrema(d1s)..., extrema(d2s)...])
dmi = Date(mi)
dma = Date(ma)
all_days = dmi:Day(1):dma
# query each day's business hours once
# accumulate durations in seconds over all days
# durations between two full days can then be computed with two lookups and a difference
cumulative_business_seconds = cumsum(Second(span(int)) for int in all_time_intervals)
map(d1s, d2s) do d1, d2
interval = d1 .. d2
# compute day indices for lookup
i1 = Dates.days(Date(d1) - dmi) + 1
i2 = Dates.days(Date(d2) - dmi) + 1
# compute durations on full days by direct lookup
first_day_seconds = Second(span(intersect(all_time_intervals[i1], interval)))
total = first_day_seconds + full_days_seconds
# avoid double dipping if both times are on the same day
if i2 > i1
last_day_seconds = Second(span(intersect(all_time_intervals[i2], interval)))
total += last_day_seconds
end
total
end
end
``````
``````julia> df = DataFrame(
date1 = rand(DateTime(2018):Day(1):DateTime(2019), 40_000) .+ Second.(rand.(Ref(1:86400))),
date2 = rand(DateTime(2020):Day(1):DateTime(2021), 40_000) .+ Second.(rand.(Ref(1:86400)))
);