Reading Data Is Still Too Slow

[re: davidanthoff] Thanks, David, for the Feather correction. Feather.materialize did not work for me, either, but FeatherFiles did. Its performance was about 130 seconds. This is better than the 300 second deserialization time and 600 second CSV.read time. However, this still seems poor. R takes about 30 seconds.

I also just posted my serialization experience in De-Serialization Performance .

1 Like

@iwelch was that with the master branch of DataFrames.jl, or the last released version?

Could you maybe post the output of running

eltypes(your_dataframe)

here? And how many rows the DataFrame has? I think that would be enough for us/me to start trying to replicate things with fake data.

@iwelch earlier linked to the Julia Cookbook which includes a snippet of the first 10 lines of the file. I donā€™t know how helpful that is but wanted to point it out.

Ah, perfect, I think that is all I needed! Lets dig in :slight_smile:

Actually, the output of eltypes would still be helpful, just to be sure I get the right column types.

I tried to load a feather file now that seems roughly similar to the CRSP file described here: 90 million rows, a mix of Int and Float64 columns. The file is about 8.5 GB on disc.

With the current released version of DataFrames.jl, it takes about 80 seconds to load on my system (I do have a very fast system :slight_smile: ). When I use the master branch of DataFrames.j, it takes somewhere between 7-15 seconds to load. All of these numbers are for FeatherFiles.jl.

So I suspect (or hope :wink: ) that @iwelchā€™s numbers from above are with the released DataFrames.jl, in which case we might actually have something very competitive once we get a new DataFrames.jl release out.

Caveat is that I havenā€™t tried a column with missing values yet.

2 Likes

yes, I try to stick to released versions, so my numbers were for DataFrames 0.14.1. May I ask what the public release policy is? (does master become stable relatively quickly?)

my numbers that were so incredibly slow (300 seconds) when all columns were Union{Missing,*}ā€¦from the other thread.

I do not believe that this helps any longer, given the other thread, but here is what I wrote to replicate a CRSP like data set (before I realized that it was a more basic vector problem with Missing):


using DataFrames, Serialization, Missings

using Serialization, DataFrames
import Serialization.serialize
serialize( filename::AbstractString, d::DataFrame )= open(filename, "w") do ofile; serialize(ofile, d); end;
import Serialization.deserialize
deserialize( filename::AbstractString )= ( o=DataFrame(); open(filename, "r") do ofile; o= deserialize(ofile); end; o );


work= (
        ( "permno", 10000:99999 ),
        ( "yyyymmdd", 19260101:20161230 ),
        ( "prc", 22.0 ),
        ( "vol", 0:1897900032, 6350829 ),
        ( "ret", 0.0008 ),
        ( "shrout", 0:29206400 ),
        ( "openprc", 35.19, 38619189 ),
        ( "numtrd", 0:1030000, 60751337 ),
        ( "retx", 0.008 ),
        ( "vwretd", 0.0004 ),
        ( "ewretd", 0.0008 ),
        ( "eom", 0:1 )
       );

const N= 88915607

df= DataFrame()
for s in work
    b= (typeof(s[2])==UnitRange{Int64}) ? rand(s[2], N) : ( randn( N ) .+ s[2] )

    b= allowmissing( b )   ## degrades deserialize performance: alloc will go from 27GB to 85GB, time from 60s to 300s

    if (length(s) == 3)
        b= allowmissing( b )
        [ b[r]= missing for r in rand( 1:N, s[3] ) ]
    end

    df[ Symbol(s[1]) ]= b
end

serialize( "crsplike-allallowmissing.jls", df )

println("written jls and csv files")

@iwelch That is very helpful! I took your code to generate a DataFrame. I then saved it as a feather file with FeatherFiles.jl: df |> save("bar.feather"). I then quit julia and restarted, just to make sure nothing is hanging around.

I then loaded the file again with:

using FeatherFiles, DataFrames

@time df = load("bar.feather") |> DataFrame;

The first time this takes 17 seconds. I then made a copy of the file on disc, and read that copy in, and that takes 9 seconds. So I think the difference between the 17 and 9 seconds is probably the compile time for the two packages, not actually read time, and I assume one only ever has to pay it once per julia session. So we get roughly 10 seconds to read this 8 GB file.

I wanted to compare the speed with R, but I canā€™t even get feather installed on Rā€¦

I did compare the load speed for the same file with pandas, with this script:

import pandas as pd
import time

t0 = time.time()
pd.read_feather("bar.feather")
t1 = time.time()

total = t1-t0
print(total)

t0 = time.time()
pd.read_feather("bar2.feather")
t1 = time.time()

total = t1-t0
print(total)

I consistently get 10 seconds for that, for both load attempts.

So essentially julia seems to have the same performance as pandas once the code in DataFrames.jl and FeatherFiles.jl is actually compiled.

I donā€™t know :slight_smile: I actually asked whether we could get a release of DataFrames.jl soon, the answer I got is this. I also donā€™t know what kind of policy they have for the stability of masterā€¦

3 Likes

Perhaps not directly relevant, but when doing storage benchmarkign worj with packages such as iozone https://www.iozone.org and ior https://github.com/llnsl/ior it is common to select the flag which says ā€œfsync to diskā€ and in these days of servers with a lot of rAM to use a file size greater than the RAM size to avoid cacheing.

Duuuh - lightbulb moment. If we use a huge file here we will end up benchmarking the storage performance, not the Julia writing logic.

when itā€™s 10 seconds for a 1GB file, we no longer need to worry about whether IO speed is due to memory or disk speed. we are far beyond disk/memory constraints, and far into julia plumbing issues.

if this had been between 0.2 and 1.0 seconds, then we would have had to worry about whether we are benchmarking memory, disk, or julia plumbing. and then we can get clever about it.

Just wondering if this slow data importing issue will be solved when Julia 1.2 is released.

pretty sure itā€™s less of a Julia issue than a CSV, DataFrame issue.

I thought to add multi-thread functionality to data import, the package developers need to have PARTR ready first.

Note that since the 0.5 release of CSV.jl (announced here), many of the previous performance issues have been resolved (and in some cases, CSV.jl is faster than any other language library). Multi-threading support in Base will be released with 1.3, and Iā€™m planning on working on supporting that in CSV.jl as well.

20 Likes

Please feel free to reuse the BlockIO logic thatā€™s being used as part of this JuliaDB PR to get a headstart on that implementation. It breaks CSV files into approximately evenly-sized ā€œblocksā€ which could then be read in parallel.

You are a legend!

3 Likes