Reading Data Is Still Too Slow

Conversion to DataFrame from the object that readsas returns is very fast and efficient. Less than half a second for that particular table.

To be clear, I don’t have a comparison handy at the moment. The 30 seconds on my machine could be much slower on yours, although I doubt it’d be that much slower. (I did just get a nice new NVMe SSD :grin:).

In general I’ve seen good performance from SASLib and that makes me happy because it makes the transition between SAS and Julia much easier.

Isn’t SASlib.jl a read-only package?

(It would also be pretty annoying for julia to have to resort to ancient sas for its main format…but I guess it is better than no fast storage format.)

If loading speed is a concern, CSV is probably suboptimal, Feather should be much faster to load, have you tried using it?

3 Likes

yes, csv is suboptimal. I had hoped that serialization would have been, given some past tests.

alas, piever you are completely correct. Feather is the way to go:

julia> @time x=Feather.read("crspdaily-clean.feather")
  1.500487 seconds (4.97 M allocations: 251.940 MiB, 5.56% gc time)
88915607×12 DataFrame. Omitted printing of 1 columns

I will change my recommendation in http://julia.cookbook.tips/doku.php?id=fileformats . So, this is great news.

(PS: I will mark this post as the answer, although piever provided it, because it has the timing numbers.)

My research back then still seem somewhat relevant Benchmarking ways to write/load DataFrames IndexedTables to disk

2 Likes

It’s quite unexpected that deserialization is slower than CSV… Can you file an issue against Julia with a minimal reproducer? Ideally you’d provide a command to generate a data frame with random values. Or even better, if you can reproduce the problem without DataFrames, just serializing/deserializing the column vectors directly.

1 Like

hi milan—I am happy to help, but this file is pretty large and experimentation is slow. Plus, the data is proprietary, so it is difficult to convey. If someone from julia computing wants to try it on my machine (it sits on the internet), I can arrange access. regards, /iaw

The timing from Feather.read is a bit misleading, though, because it actually doesn’t read any of the data from disc. All it does is read the metadata from disc (i.e. what columns are there in the file, what types do they have), and then it creates a DataFrame that will load things from disc on-demand when you actually access the values. So you will only pay the price of loading stuff from disc once you start to access the data (e.g. by indexing into the DataFrame). Of course, at that point these kind of data access operations on DataFrame will be much, much slower than with a normal DataFrame (because they actually include a trip to the disc). That all can be nice if one knows about it, but it of course means that you want to write very different algorithms and use special access patterns when you have one of these special DataFrames. I also think manipulating such a DataFrame probably also won’t work properly.

I think in theory Feather.materialize should give you a normal DataFrame, i.e. load everything from disc and load stuff into main memory, in a DataFrame. But at least on my system that seems not to work right now. But in any case, an apples-to-apples comparison of read performance would use Feather.materialize, not Feather.read.

FeatherFiles.jl is another option. You would read a file via:

using FeatherFiles, DataFrames

df = load("foo.feather") |> DataFrame

So the same API pattern you use for CSVFiles.jl. It will read everything fully from disc, i.e. give you a normal DataFrame.

The same caveat about overhead that I made with respect to CSVFiles.jl applies here as well: if you use the master branch of DataFrames.jl you’ll get much better performance than with the released version of DataFrames.jl.

You can get a comprehensive documentation about the file IO story in Queryverse.jl over here.

3 Likes

@iwelch maybe you can also try this? 6G isn’t too big and I am assuming that you are reading everything into memory.

@tk3369 We should add CSVReader.jl to https://github.com/davidanthoff/csv-comparison! Any chance you might register it in METADATA.jl, then I could add it into the mix for the comparison.

2 Likes

What I meant is that we probably don’t care about the data, any set of columns of the same type containing random data should give similar timings. So you can write a few lines of code to create such a data frame, and if it reproduces the problem that will make it much easier to debug. If it doesn’t reproduce the problem, then that will be an interesting piece of information (meaning particular values matter).

3 Likes

[re: davidanthoff] Thanks, David, for the Feather correction. Feather.materialize did not work for me, either, but FeatherFiles did. Its performance was about 130 seconds. This is better than the 300 second deserialization time and 600 second CSV.read time. However, this still seems poor. R takes about 30 seconds.

I also just posted my serialization experience in De-Serialization Performance .

1 Like

@iwelch was that with the master branch of DataFrames.jl, or the last released version?

Could you maybe post the output of running

eltypes(your_dataframe)

here? And how many rows the DataFrame has? I think that would be enough for us/me to start trying to replicate things with fake data.

@iwelch earlier linked to the Julia Cookbook which includes a snippet of the first 10 lines of the file. I don’t know how helpful that is but wanted to point it out.

Ah, perfect, I think that is all I needed! Lets dig in :slight_smile:

Actually, the output of eltypes would still be helpful, just to be sure I get the right column types.

I tried to load a feather file now that seems roughly similar to the CRSP file described here: 90 million rows, a mix of Int and Float64 columns. The file is about 8.5 GB on disc.

With the current released version of DataFrames.jl, it takes about 80 seconds to load on my system (I do have a very fast system :slight_smile: ). When I use the master branch of DataFrames.j, it takes somewhere between 7-15 seconds to load. All of these numbers are for FeatherFiles.jl.

So I suspect (or hope :wink: ) that @iwelch’s numbers from above are with the released DataFrames.jl, in which case we might actually have something very competitive once we get a new DataFrames.jl release out.

Caveat is that I haven’t tried a column with missing values yet.

2 Likes

yes, I try to stick to released versions, so my numbers were for DataFrames 0.14.1. May I ask what the public release policy is? (does master become stable relatively quickly?)

my numbers that were so incredibly slow (300 seconds) when all columns were Union{Missing,*}…from the other thread.

I do not believe that this helps any longer, given the other thread, but here is what I wrote to replicate a CRSP like data set (before I realized that it was a more basic vector problem with Missing):


using DataFrames, Serialization, Missings

using Serialization, DataFrames
import Serialization.serialize
serialize( filename::AbstractString, d::DataFrame )= open(filename, "w") do ofile; serialize(ofile, d); end;
import Serialization.deserialize
deserialize( filename::AbstractString )= ( o=DataFrame(); open(filename, "r") do ofile; o= deserialize(ofile); end; o );


work= (
        ( "permno", 10000:99999 ),
        ( "yyyymmdd", 19260101:20161230 ),
        ( "prc", 22.0 ),
        ( "vol", 0:1897900032, 6350829 ),
        ( "ret", 0.0008 ),
        ( "shrout", 0:29206400 ),
        ( "openprc", 35.19, 38619189 ),
        ( "numtrd", 0:1030000, 60751337 ),
        ( "retx", 0.008 ),
        ( "vwretd", 0.0004 ),
        ( "ewretd", 0.0008 ),
        ( "eom", 0:1 )
       );

const N= 88915607

df= DataFrame()
for s in work
    b= (typeof(s[2])==UnitRange{Int64}) ? rand(s[2], N) : ( randn( N ) .+ s[2] )

    b= allowmissing( b )   ## degrades deserialize performance: alloc will go from 27GB to 85GB, time from 60s to 300s

    if (length(s) == 3)
        b= allowmissing( b )
        [ b[r]= missing for r in rand( 1:N, s[3] ) ]
    end

    df[ Symbol(s[1]) ]= b
end

serialize( "crsplike-allallowmissing.jls", df )

println("written jls and csv files")

@iwelch That is very helpful! I took your code to generate a DataFrame. I then saved it as a feather file with FeatherFiles.jl: df |> save("bar.feather"). I then quit julia and restarted, just to make sure nothing is hanging around.

I then loaded the file again with:

using FeatherFiles, DataFrames

@time df = load("bar.feather") |> DataFrame;

The first time this takes 17 seconds. I then made a copy of the file on disc, and read that copy in, and that takes 9 seconds. So I think the difference between the 17 and 9 seconds is probably the compile time for the two packages, not actually read time, and I assume one only ever has to pay it once per julia session. So we get roughly 10 seconds to read this 8 GB file.

I wanted to compare the speed with R, but I can’t even get feather installed on R…

I did compare the load speed for the same file with pandas, with this script:

import pandas as pd
import time

t0 = time.time()
pd.read_feather("bar.feather")
t1 = time.time()

total = t1-t0
print(total)

t0 = time.time()
pd.read_feather("bar2.feather")
t1 = time.time()

total = t1-t0
print(total)

I consistently get 10 seconds for that, for both load attempts.

So essentially julia seems to have the same performance as pandas once the code in DataFrames.jl and FeatherFiles.jl is actually compiled.

I don’t know :slight_smile: I actually asked whether we could get a release of DataFrames.jl soon, the answer I got is this. I also don’t know what kind of policy they have for the stability of master

3 Likes