yes, I try to stick to released versions, so my numbers were for DataFrames 0.14.1. May I ask what the public release policy is? (does master become stable relatively quickly?)
my numbers that were so incredibly slow (300 seconds) when all columns were Union{Missing,*}ā¦from the other thread.
I do not believe that this helps any longer, given the other thread, but here is what I wrote to replicate a CRSP like data set (before I realized that it was a more basic vector problem with Missing):
using DataFrames, Serialization, Missings
using Serialization, DataFrames
import Serialization.serialize
serialize( filename::AbstractString, d::DataFrame )= open(filename, "w") do ofile; serialize(ofile, d); end;
import Serialization.deserialize
deserialize( filename::AbstractString )= ( o=DataFrame(); open(filename, "r") do ofile; o= deserialize(ofile); end; o );
work= (
( "permno", 10000:99999 ),
( "yyyymmdd", 19260101:20161230 ),
( "prc", 22.0 ),
( "vol", 0:1897900032, 6350829 ),
( "ret", 0.0008 ),
( "shrout", 0:29206400 ),
( "openprc", 35.19, 38619189 ),
( "numtrd", 0:1030000, 60751337 ),
( "retx", 0.008 ),
( "vwretd", 0.0004 ),
( "ewretd", 0.0008 ),
( "eom", 0:1 )
);
const N= 88915607
df= DataFrame()
for s in work
b= (typeof(s[2])==UnitRange{Int64}) ? rand(s[2], N) : ( randn( N ) .+ s[2] )
b= allowmissing( b ) ## degrades deserialize performance: alloc will go from 27GB to 85GB, time from 60s to 300s
if (length(s) == 3)
b= allowmissing( b )
[ b[r]= missing for r in rand( 1:N, s[3] ) ]
end
df[ Symbol(s[1]) ]= b
end
serialize( "crsplike-allallowmissing.jls", df )
println("written jls and csv files")