In my use case below,
Dates.format is rather slow when applied to a vector.
I was wondering if there is a better way to broadcast this.
I am well aware, that my custom function uses ‘additional knowledge’ (few unique values) and thus the comparison may not be entirely fair. Still I am surprised about the slow execution of the broadcasted version.
uqv = unique(col)
uqvDate = Dates.format.(uqv,fmt)
dateDict = Dict(uqv .=> uqvDate)
res = map(x->dateDict[x],col)
sze = 10_000_000;
sze = 1_000_000;
dts = Date(2000,1,1) .+ Dates.Day.(trunc.(Int,1500 * rand(sze)));
@time Dates.format.(dts,"yyyymm"); #20 seconds for 1 mio rows / 260 seconds for 10 mio rows
@time datesformatcustom(dts,"yyyymm"); #0.3 seconds
Your implementation is very clever and will be efficient when the ratio of unique dates to total length is very small, which is the case in your example.
Dates formatting does actually seem kind of slow, to be honest. I’m sure that could be optimized and sped up. But I wouldn’t expect broadcasting to do anything as clever as what
Well, I know nothing about date parsing or manipulation.
But in the case above, the data is already parsed as a date (thus the information is “readily available in a practical format”). Even this line is considerably faster than the broadcasting
@time alternative = string.(year.(dts) * 100 .+ month.(dts)) #2.7 seconds about 6-7 times faster than broadcasting
EDIT: I am well aware that this will fail for years < 1000 (and maybe years > 9999 if that exists)
I guess Dates.format is slow, because it can handle a wild variety of formats, while the format I request here is actually something very simple.
instead. See the docs:
Creating a DateFormat object is expensive. Whenever
possible, create it once and use it many times or try the
dateformat"" string macro. Using this macro creates the
DateFormat object once at macro expansion time and reuses it
later. see @dateformat_str.
Thank you kristoffer.
I actually briefly thought about that, but did not try it out.
Indeed this results in 4seconds for 1 million entries, which is fast enough.