DateFrame.jl slow, convert one string-column to UTC time (Int)

package

#1

I have one DataFrame with a column of Time in the format of String. I want to convert this column to UTC time in second (Int). The code is as follows. However, it is quite slow. Could anyone give some advice to improve the speed? Thanks!

function string2UTC(x::String)
    t = -1
    try
        td = Dates.DateTime(x, "yyyy-mm-dd HH:MM:SS")
        t = convert(Int, datetime2unix(td) )
    catch 
        return Missing
    end
    return t 
end

@time df_ais[:t] = string2UTC.( df_ais[:vmim_timestamp] )

The table has only 3,233,003 rows. The operation takes 207.216866 seconds (556.08 M allocations: 22.040 GiB, 4.91% gc time). The similar operation using Pandas + Python seems much faster.

%E6%8D%95%E8%8E%B7


#2

Since you are using the same format multiple times, it is perhaps better to use a DateFormat object, eg

dateformat = Dates.DateFormat("yyyy-mm-dd HH:MM:SS")

then parse with

Dates.DateTime(x, dateformat)

Also note that tryparse has a method for parsing dates (via DateFormat), but that will still give a Nullable in v0.6.3, so you have to convert to missing manually.

Finally, perhaps you want to return missing (the value), not the type Missing.


#3

Hi, @Tamas_Papp Thank you so much for your advice. I am using Julia v0.6.2 and the converted time column type is Unin{Missing, Int64} .
I have modified the code, however, there is no speed increase.

function string2UTC2(x::String) :: Union{Missing,Int}
    dateformat = Dates.DateFormat("yyyy-mm-dd HH:MM:SS")
    t = Missing
    try
        td = Dates.DateTime(x, dateformat)
        t = convert(Int, datetime2unix(td) )
    catch 
        return Missing
    end
    return t 
end

Running time: 227.854824 seconds (562.55 M allocations: 22.137 GiB, 5.23% gc time)


#4

Possibly because you did not do what I suggested? Eg (to make an MWE)

using Base.Dates
using Missings

N = 3233003
dform = DateFormat("yyyy-mm-dd HH:MM:SS")

dates = [Dates.format(DateTime(rand(1900:2100), rand(1:12), rand(1:28),
                               rand(0:23), rand(0:59), rand(0:59)), dform)
         for _ in 1:N];

function datestring2UTC(str, dform)
    maybedate = tryparse(DateTime, str, dform)
    isnull(maybedate) ? missing : convert(Int, datetime2unix(get(maybedate)))
end

datestring2UTC(dates[1], dform) # to compile

then

julia> @time datestring2UTC.(dates, dform);
  0.836423 seconds (12.93 M allocations: 320.658 MiB, 32.44% gc time)

#5

Also, in v0.7,

using Dates
N = 3233003
dform = DateFormat("yyyy-mm-dd HH:MM:SS")
dates = [Dates.format(DateTime(rand(1900:2100), rand(1:12), rand(1:28),
                               rand(0:23), rand(0:59), rand(0:59)), dform)
         for _ in 1:N];

function datestring2UTC(str, dform)
    date = tryparse(DateTime, str, dform)
    date == nothing ? missing : convert(Int, datetime2unix(date))
end

then

julia> @time datestring2UTC.(dates, dform);
  0.498187 seconds (14 allocations: 24.666 MiB, 1.38% gc time)

#6

Does it make any difference to change dform to a const?


#7

I would be surprised if it made any difference, since I am passing it as an argument to the broadcasted function.


#8

@Tamas_Papp Thanks for your help! It runs quite fast now! The reason of the slow speed is caused by the definition of the local variable “dform” within the function. The code is as follows.

dform = DateFormat("yyyy-mm-dd HH:MM:SS")
function string2UTC(str::String, dform)    
    date = tryparse(DateTime, str, dform)
    #date == nothing ? Missing : convert(Int, datetime2unix( date ) )
    date == nothing ? Missing : convert(Int, datetime2unix( get(date) ) )    
end
@time df_ais[:t] = string2UTC.( df_ais[:vmim_timestamp], dform )

The running time is 2.617471 seconds (9.71 M allocations: 271.882 MiB, 69.95% gc time)


#9

@ScottPJones There is no significant speed increase by change dform to a const. Thank you for your reply! Defining a local variable slows the speed by hundreds of times. Feel surprised!

The const variable version is as follows.

function string2UTC2(str::String)    
    const dform = DateFormat("yyyy-mm-dd HH:MM:SS")
    date = tryparse(DateTime, str, dform)
    #date == nothing ? Missing : convert(Int, datetime2unix( date ) )
    date == nothing ? Missing : convert(Int, datetime2unix( get(date) ) )    
end
@show string2UTC2( "2017-09-01 03:48:46" )

@time df_ais[:t] = string2UTC2.( df_ais[:vmim_timestamp] )

The running time is 202.683401 seconds (552.85 M allocations: 22.089 GiB, 4.97% gc time)


#10

I hadn’t looked to see how Julia would optimize that, if you made dform a const, and didn’t pass it do datestring2UTC.


#11

Not quite, it is caused by DateFormat constructing a parser object. The idea is to do this once.


#12

@Tamas_Papp Thank you much! I never think this DateFormat object construction operation is soon time-consuming. With this operation it costs about 200 seconds, while only costs about 2 seconds without this operation. That is to say, “DateFormat()” construction function cost about 198 seconds, while “tryparse” and “convert” functions cost only 2 seconds. Also feel surprised!


#13

Yes, I understand that that is by far the largest part of it, I was just curious, if making it a const (outside of the function, and then not passing it, might make a difference).

(edit) OK, I just got out of bed and tested it, and no measurable difference.


#14

Hi @zhangliye, you still have not got all of it…
It must be:

const dform = DateFormat("yyyy-mm-dd HH:MM:SS") #### !!!!!!!!!!!!! const outside
function string2UTC2(str::String) 
    date = tryparse(DateTime, str, dform)
    date == nothing ? #=!!!!=# missing #=!!!!=# : convert(Int, datetime2unix( get(date) ) ) 
   ##### !!!!!!!!!!!!!! `Missing` is the type, `missing` is the instance of that ype (like `true` is an instance of `Bool`)
end
@show string2UTC2( "2017-09-01 03:48:46" )

@time df_ais[:t] = string2UTC2.( df_ais[:vmim_timestamp] )

#15

Thanks! It runs fast when the DateFormat object is defined out side the function no matter whether DateFormat is const.