My experiences reading CSVs from the Fannie Mae datasets

aaowens · March 1, 2019, 11:37pm

I saw these new docs on out of core. Maybe it will help you: https://juliacomputing.github.io/JuliaDB.jl/latest/out_of_core/

I guess you need to increase the number of chunks enough so that it will fit in memory.

Yifan_Liu · March 2, 2019, 12:49am

Actually the only example that JuliaDB works with TrueFX data is not that useful. Since TrueFX has very good API in R and Python, I can directly get the data into R or Python with customized filtering conditions, there is no need to download so many csv files and then load them into JuliaDB.

I hope JuliaDB could use a more useful example.

xiaodai · March 19, 2019, 9:52am

CSV.jl can read the data now without much help! But it’s still slower than using R’s data.table and then reading the data into Julia using @rget. So performance is still an issue.

download("https://github.com/xiaodaigh/testing/raw/master/Performance_2016Q1.zip", "ok.zip")

run(`unzip ok.zip`)

using DataFrames, TableReader

@time a = readcsv("Performance_2016Q1.csv", delim ='|', header=false, chunkszie = 0)


using FileIO, TextParse, CSV, DataFrames
@time b = CSV.read("Performance_2016Q1.csv", delim = '|', header=0)
@time adf = DataFrame(b) # 22 seconds


using RCall

function a()
R"""
adf1 = data.table::fread('Performance_2016Q1.csv')
"""
@rget adf1
R"rm(adf1)"
adf1
end

@time a() # 15 seconds

sdanisch · March 19, 2019, 10:23am

I see you have TableReader in there, but don’t report any timings?
I tried it, and it seems like the fastest of the three!
Timings:
R.fread: 20s
CSV: 12s
TableReader: 8s

using TableReader, CSV, RCall
function fread()
    R"""
    adf1 = data.table::fread('/home/sd/Downloads/Performance_2016Q1.csv')
    """
    @rget adf1
    R"rm(adf1)"
    adf1
end

@time readcsv(path, delim = '|', chunksize = 0, header = [Symbol("var_$i") for i in 1:31])
@time CSV.read(path, delim = '|', header=0)
@time fread()

xiaodai · March 19, 2019, 10:36am

Now I will try it! It ran with a bug so I reported it, but the magic seems to be header = ...

Now I am getting an error which i have reported here.

doing data.table::fread in R should still be faster than.

js135005 · March 19, 2019, 12:17pm

I’m hitting the same TableReader problem (on Windows) with a fairly small tsv file when specifying the delim=‘\t’ keyword.

Yifan_Liu · March 19, 2019, 2:25pm

Just quick question, is there any function to read fst files in Julia 1.1?

xiaodai · March 19, 2019, 8:36pm

FstFiles.jl or FstFormatFiles.jl part of the Queryverse. But it doesn’t load them natively. It uses RCall so data transfer is slow

xiaodai · March 19, 2019, 8:55pm

Do u think u can run the above on Windows and see if you run into issues?

sdanisch · March 20, 2019, 10:51am

On my slower windows PC I get:

TableReader: ReadOnlyMemoryError()
CSV: 25s
fread: 29s

xiaodai · March 20, 2019, 10:57am

Ok, so we are seeing the same issues.

xiaodai · March 20, 2019, 11:51pm

Oh wow. Using R’s data.table via RCall and then writing the file to disk using feather and then reading it in using Feather.jl is faster than all the CSV parsers in the Julia-verse I’ve tried.

See it for yourself

download("https://github.com/xiaodaigh/testing/raw/master/Performance_2016Q1.zip", "ok.zip")

run(`unzip ok.zip`)
using FileIO, CSVFiles, DataFrames

using RCall, DataFrames, Feather
path = "Performance_2016Q1.csv"

function fread(path)
  R"""
  memory.limit(4095*2)  
  feather::write_feather(data.table::fread($path), "x.feather")
  gc()
  """;  

Feather.read("x.feather")
end

@time a = fread(path) # 20 seconds vs 50 seconds vs Julia CSV readers

matthieu · March 21, 2019, 1:12pm

I have a hard time understanding why this is faster than rget! In any case, maybe you could write a small package with this function?

sdanisch · March 21, 2019, 1:43pm

Because Feather was specifically designed for performance, while csv was born out of convenience…

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

    Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible

    Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.

    High read and write performance. When possible, Feather operations should be bound by local disk performance.

matthieu · March 21, 2019, 1:50pm

The weird thing is that, to communicate between R and Julia, rget is slower than writing/reading a feather file.

sdanisch · March 21, 2019, 2:39pm

Why is that weird?
Communication between R & Julia should be very fast, reading/writing feather should be substantially faster than reading a CSV file, even if you use the absolute best CSV reader.

xiaodai · March 21, 2019, 2:44pm

The issue isn’t rget is slower it’s that reading the CSV in R then write it out to feather and then reading it in is faster than reading the CSV directly in Julia

Maybe a faster CSV reader in Julia will take time to emerge or it’s not possible at all. It’s not certain either way for me.

sdanisch · March 21, 2019, 2:57pm

Ah, guess I should have looked at what rget actually does… I didn’t realize, that the R benchmark never actually benchmarked the pure R performance, but instead also converting it to Julia etc!

matthieu · March 21, 2019, 3:45pm

Oh I see. So rget is still faster than reading/writing with feather? In any case, could you create a fread package that does this kind of stuff automatically? I think it would be useful.

xiaodai · March 21, 2019, 3:49pm

I guess so… this is so weird though. All Julia CSV readers feel lethargic atm. Grouping operstions are improving though.

Topic		Replies	Views
[ANN] Fread.jl - read CSVs faster with the help of R's {data.table} Package Announcements performance , data , csv	6	2049	October 9, 2019
Trying to analyse Fannie Mae data with JuliaDB Data juliadb	14	1339	August 31, 2019
Reading Data Is Still Too Slow Data	35	8804	August 2, 2019
Why do you use JuliaDB? General Usage	9	2138	October 28, 2019
CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R General Usage announcement	68	8852	March 23, 2022

My experiences reading CSVs from the Fannie Mae datasets

Related topics