Reading Data Is Still Too Slow

The timing from Feather.read is a bit misleading, though, because it actually doesn’t read any of the data from disc. All it does is read the metadata from disc (i.e. what columns are there in the file, what types do they have), and then it creates a DataFrame that will load things from disc on-demand when you actually access the values. So you will only pay the price of loading stuff from disc once you start to access the data (e.g. by indexing into the DataFrame). Of course, at that point these kind of data access operations on DataFrame will be much, much slower than with a normal DataFrame (because they actually include a trip to the disc). That all can be nice if one knows about it, but it of course means that you want to write very different algorithms and use special access patterns when you have one of these special DataFrames. I also think manipulating such a DataFrame probably also won’t work properly.

I think in theory Feather.materialize should give you a normal DataFrame, i.e. load everything from disc and load stuff into main memory, in a DataFrame. But at least on my system that seems not to work right now. But in any case, an apples-to-apples comparison of read performance would use Feather.materialize, not Feather.read.

FeatherFiles.jl is another option. You would read a file via:

using FeatherFiles, DataFrames

df = load("foo.feather") |> DataFrame

So the same API pattern you use for CSVFiles.jl. It will read everything fully from disc, i.e. give you a normal DataFrame.

The same caveat about overhead that I made with respect to CSVFiles.jl applies here as well: if you use the master branch of DataFrames.jl you’ll get much better performance than with the released version of DataFrames.jl.

You can get a comprehensive documentation about the file IO story in Queryverse.jl over here.

3 Likes

@iwelch maybe you can also try this? 6G isn’t too big and I am assuming that you are reading everything into memory.

@tk3369 We should add CSVReader.jl to https://github.com/davidanthoff/csv-comparison! Any chance you might register it in METADATA.jl, then I could add it into the mix for the comparison.

2 Likes

What I meant is that we probably don’t care about the data, any set of columns of the same type containing random data should give similar timings. So you can write a few lines of code to create such a data frame, and if it reproduces the problem that will make it much easier to debug. If it doesn’t reproduce the problem, then that will be an interesting piece of information (meaning particular values matter).

3 Likes

[re: davidanthoff] Thanks, David, for the Feather correction. Feather.materialize did not work for me, either, but FeatherFiles did. Its performance was about 130 seconds. This is better than the 300 second deserialization time and 600 second CSV.read time. However, this still seems poor. R takes about 30 seconds.

I also just posted my serialization experience in De-Serialization Performance .

1 Like

@iwelch was that with the master branch of DataFrames.jl, or the last released version?

Could you maybe post the output of running

eltypes(your_dataframe)

here? And how many rows the DataFrame has? I think that would be enough for us/me to start trying to replicate things with fake data.

@iwelch earlier linked to the Julia Cookbook which includes a snippet of the first 10 lines of the file. I don’t know how helpful that is but wanted to point it out.

Ah, perfect, I think that is all I needed! Lets dig in :slight_smile:

Actually, the output of eltypes would still be helpful, just to be sure I get the right column types.

I tried to load a feather file now that seems roughly similar to the CRSP file described here: 90 million rows, a mix of Int and Float64 columns. The file is about 8.5 GB on disc.

With the current released version of DataFrames.jl, it takes about 80 seconds to load on my system (I do have a very fast system :slight_smile: ). When I use the master branch of DataFrames.j, it takes somewhere between 7-15 seconds to load. All of these numbers are for FeatherFiles.jl.

So I suspect (or hope :wink: ) that @iwelch’s numbers from above are with the released DataFrames.jl, in which case we might actually have something very competitive once we get a new DataFrames.jl release out.

Caveat is that I haven’t tried a column with missing values yet.

2 Likes

yes, I try to stick to released versions, so my numbers were for DataFrames 0.14.1. May I ask what the public release policy is? (does master become stable relatively quickly?)

my numbers that were so incredibly slow (300 seconds) when all columns were Union{Missing,*}…from the other thread.

I do not believe that this helps any longer, given the other thread, but here is what I wrote to replicate a CRSP like data set (before I realized that it was a more basic vector problem with Missing):


using DataFrames, Serialization, Missings

using Serialization, DataFrames
import Serialization.serialize
serialize( filename::AbstractString, d::DataFrame )= open(filename, "w") do ofile; serialize(ofile, d); end;
import Serialization.deserialize
deserialize( filename::AbstractString )= ( o=DataFrame(); open(filename, "r") do ofile; o= deserialize(ofile); end; o );


work= (
        ( "permno", 10000:99999 ),
        ( "yyyymmdd", 19260101:20161230 ),
        ( "prc", 22.0 ),
        ( "vol", 0:1897900032, 6350829 ),
        ( "ret", 0.0008 ),
        ( "shrout", 0:29206400 ),
        ( "openprc", 35.19, 38619189 ),
        ( "numtrd", 0:1030000, 60751337 ),
        ( "retx", 0.008 ),
        ( "vwretd", 0.0004 ),
        ( "ewretd", 0.0008 ),
        ( "eom", 0:1 )
       );

const N= 88915607

df= DataFrame()
for s in work
    b= (typeof(s[2])==UnitRange{Int64}) ? rand(s[2], N) : ( randn( N ) .+ s[2] )

    b= allowmissing( b )   ## degrades deserialize performance: alloc will go from 27GB to 85GB, time from 60s to 300s

    if (length(s) == 3)
        b= allowmissing( b )
        [ b[r]= missing for r in rand( 1:N, s[3] ) ]
    end

    df[ Symbol(s[1]) ]= b
end

serialize( "crsplike-allallowmissing.jls", df )

println("written jls and csv files")

@iwelch That is very helpful! I took your code to generate a DataFrame. I then saved it as a feather file with FeatherFiles.jl: df |> save("bar.feather"). I then quit julia and restarted, just to make sure nothing is hanging around.

I then loaded the file again with:

using FeatherFiles, DataFrames

@time df = load("bar.feather") |> DataFrame;

The first time this takes 17 seconds. I then made a copy of the file on disc, and read that copy in, and that takes 9 seconds. So I think the difference between the 17 and 9 seconds is probably the compile time for the two packages, not actually read time, and I assume one only ever has to pay it once per julia session. So we get roughly 10 seconds to read this 8 GB file.

I wanted to compare the speed with R, but I can’t even get feather installed on R…

I did compare the load speed for the same file with pandas, with this script:

import pandas as pd
import time

t0 = time.time()
pd.read_feather("bar.feather")
t1 = time.time()

total = t1-t0
print(total)

t0 = time.time()
pd.read_feather("bar2.feather")
t1 = time.time()

total = t1-t0
print(total)

I consistently get 10 seconds for that, for both load attempts.

So essentially julia seems to have the same performance as pandas once the code in DataFrames.jl and FeatherFiles.jl is actually compiled.

I don’t know :slight_smile: I actually asked whether we could get a release of DataFrames.jl soon, the answer I got is this. I also don’t know what kind of policy they have for the stability of master

3 Likes

Perhaps not directly relevant, but when doing storage benchmarkign worj with packages such as iozone https://www.iozone.org and ior https://github.com/llnsl/ior it is common to select the flag which says “fsync to disk” and in these days of servers with a lot of rAM to use a file size greater than the RAM size to avoid cacheing.

Duuuh - lightbulb moment. If we use a huge file here we will end up benchmarking the storage performance, not the Julia writing logic.

when it’s 10 seconds for a 1GB file, we no longer need to worry about whether IO speed is due to memory or disk speed. we are far beyond disk/memory constraints, and far into julia plumbing issues.

if this had been between 0.2 and 1.0 seconds, then we would have had to worry about whether we are benchmarking memory, disk, or julia plumbing. and then we can get clever about it.

Just wondering if this slow data importing issue will be solved when Julia 1.2 is released.

pretty sure it’s less of a Julia issue than a CSV, DataFrame issue.

I thought to add multi-thread functionality to data import, the package developers need to have PARTR ready first.

Note that since the 0.5 release of CSV.jl (announced here), many of the previous performance issues have been resolved (and in some cases, CSV.jl is faster than any other language library). Multi-threading support in Base will be released with 1.3, and I’m planning on working on supporting that in CSV.jl as well.

20 Likes

Please feel free to reuse the BlockIO logic that’s being used as part of this JuliaDB PR to get a headstart on that implementation. It breaks CSV files into approximately evenly-sized “blocks” which could then be read in parallel.

You are a legend!

3 Likes