Reading Data Is Still Too Slow

iwelch · November 24, 2018, 7:49pm

[re: davidanthoff] Thanks, David, for the Feather correction. Feather.materialize did not work for me, either, but FeatherFiles did. Its performance was about 130 seconds. This is better than the 300 second deserialization time and 600 second CSV.read time. However, this still seems poor. R takes about 30 seconds.

I also just posted my serialization experience in De-Serialization Performance .

davidanthoff · November 24, 2018, 9:28pm

@iwelch was that with the master branch of DataFrames.jl, or the last released version?

Could you maybe post the output of running

eltypes(your_dataframe)

here? And how many rows the DataFrame has? I think that would be enough for us/me to start trying to replicate things with fake data.

Jordan_Cluts · November 24, 2018, 9:45pm

@iwelch earlier linked to the Julia Cookbook which includes a snippet of the first 10 lines of the file. I don’t know how helpful that is but wanted to point it out.

davidanthoff · November 24, 2018, 9:50pm

Ah, perfect, I think that is all I needed! Lets dig in

davidanthoff · November 24, 2018, 10:41pm

Actually, the output of eltypes would still be helpful, just to be sure I get the right column types.

davidanthoff · November 24, 2018, 11:47pm

I tried to load a feather file now that seems roughly similar to the CRSP file described here: 90 million rows, a mix of Int and Float64 columns. The file is about 8.5 GB on disc.

With the current released version of DataFrames.jl, it takes about 80 seconds to load on my system (I do have a very fast system ). When I use the master branch of DataFrames.j, it takes somewhere between 7-15 seconds to load. All of these numbers are for FeatherFiles.jl.

So I suspect (or hope ) that @iwelch’s numbers from above are with the released DataFrames.jl, in which case we might actually have something very competitive once we get a new DataFrames.jl release out.

Caveat is that I haven’t tried a column with missing values yet.

iwelch · November 25, 2018, 2:57am

yes, I try to stick to released versions, so my numbers were for DataFrames 0.14.1. May I ask what the public release policy is? (does master become stable relatively quickly?)

my numbers that were so incredibly slow (300 seconds) when all columns were Union{Missing,*}…from the other thread.

I do not believe that this helps any longer, given the other thread, but here is what I wrote to replicate a CRSP like data set (before I realized that it was a more basic vector problem with Missing):


using DataFrames, Serialization, Missings

using Serialization, DataFrames
import Serialization.serialize
serialize( filename::AbstractString, d::DataFrame )= open(filename, "w") do ofile; serialize(ofile, d); end;
import Serialization.deserialize
deserialize( filename::AbstractString )= ( o=DataFrame(); open(filename, "r") do ofile; o= deserialize(ofile); end; o );


work= (
        ( "permno", 10000:99999 ),
        ( "yyyymmdd", 19260101:20161230 ),
        ( "prc", 22.0 ),
        ( "vol", 0:1897900032, 6350829 ),
        ( "ret", 0.0008 ),
        ( "shrout", 0:29206400 ),
        ( "openprc", 35.19, 38619189 ),
        ( "numtrd", 0:1030000, 60751337 ),
        ( "retx", 0.008 ),
        ( "vwretd", 0.0004 ),
        ( "ewretd", 0.0008 ),
        ( "eom", 0:1 )
       );

const N= 88915607

df= DataFrame()
for s in work
    b= (typeof(s[2])==UnitRange{Int64}) ? rand(s[2], N) : ( randn( N ) .+ s[2] )

    b= allowmissing( b )   ## degrades deserialize performance: alloc will go from 27GB to 85GB, time from 60s to 300s

    if (length(s) == 3)
        b= allowmissing( b )
        [ b[r]= missing for r in rand( 1:N, s[3] ) ]
    end

    df[ Symbol(s[1]) ]= b
end

serialize( "crsplike-allallowmissing.jls", df )

println("written jls and csv files")

davidanthoff · November 25, 2018, 3:58am

@iwelch That is very helpful! I took your code to generate a DataFrame. I then saved it as a feather file with FeatherFiles.jl: df |> save("bar.feather"). I then quit julia and restarted, just to make sure nothing is hanging around.

I then loaded the file again with:

using FeatherFiles, DataFrames

@time df = load("bar.feather") |> DataFrame;

The first time this takes 17 seconds. I then made a copy of the file on disc, and read that copy in, and that takes 9 seconds. So I think the difference between the 17 and 9 seconds is probably the compile time for the two packages, not actually read time, and I assume one only ever has to pay it once per julia session. So we get roughly 10 seconds to read this 8 GB file.

I wanted to compare the speed with R, but I can’t even get feather installed on R…

I did compare the load speed for the same file with pandas, with this script:

import pandas as pd
import time

t0 = time.time()
pd.read_feather("bar.feather")
t1 = time.time()

total = t1-t0
print(total)

t0 = time.time()
pd.read_feather("bar2.feather")
t1 = time.time()

total = t1-t0
print(total)

I consistently get 10 seconds for that, for both load attempts.

So essentially julia seems to have the same performance as pandas once the code in DataFrames.jl and FeatherFiles.jl is actually compiled.

I don’t know I actually asked whether we could get a release of DataFrames.jl soon, the answer I got is this. I also don’t know what kind of policy they have for the stability of master…

johnh · November 25, 2018, 6:48am

Perhaps not directly relevant, but when doing storage benchmarkign worj with packages such as iozone https://www.iozone.org and ior https://github.com/llnsl/ior it is common to select the flag which says “fsync to disk” and in these days of servers with a lot of rAM to use a file size greater than the RAM size to avoid cacheing.

Duuuh - lightbulb moment. If we use a huge file here we will end up benchmarking the storage performance, not the Julia writing logic.

iwelch · November 25, 2018, 4:35pm

when it’s 10 seconds for a 1GB file, we no longer need to worry about whether IO speed is due to memory or disk speed. we are far beyond disk/memory constraints, and far into julia plumbing issues.

if this had been between 0.2 and 1.0 seconds, then we would have had to worry about whether we are benchmarking memory, disk, or julia plumbing. and then we can get clever about it.

Yifan_Liu · July 10, 2019, 2:20pm

Just wondering if this slow data importing issue will be solved when Julia 1.2 is released.

jling · July 10, 2019, 3:12pm

pretty sure it’s less of a Julia issue than a CSV, DataFrame issue.

Yifan_Liu · July 10, 2019, 3:29pm

I thought to add multi-thread functionality to data import, the package developers need to have PARTR ready first.

quinnj · August 2, 2019, 1:57pm

Note that since the 0.5 release of CSV.jl (announced here), many of the previous performance issues have been resolved (and in some cases, CSV.jl is faster than any other language library). Multi-threading support in Base will be released with 1.3, and I’m planning on working on supporting that in CSV.jl as well.

jpsamaroo · August 2, 2019, 3:16pm

Please feel free to reuse the BlockIO logic that’s being used as part of this JuliaDB PR to get a headstart on that implementation. It breaks CSV files into approximately evenly-sized “blocks” which could then be read in parallel.

xiaodai · August 2, 2019, 10:49pm

You are a legend!

Topic		Replies	Views
CSV read in is too slow than other language General Usage performance	13	1351	June 21, 2023
My experiences reading CSVs from the Fannie Mae datasets Data performance , csv	62	6134	August 26, 2019
Problem using DataFrames master Tooling question	5	1037	November 26, 2018
[ANN] Fread.jl - read CSVs faster with the help of R's {data.table} Package Announcements performance , data , csv	6	2052	October 9, 2019
First try seems a bit sluggish Performance	5	619	February 21, 2021

Reading Data Is Still Too Slow

Related topics