My experiences reading CSVs from the Fannie Mae datasets

I saw these new docs on out of core. Maybe it will help you: https://juliacomputing.github.io/JuliaDB.jl/latest/out_of_core/

I guess you need to increase the number of chunks enough so that it will fit in memory.

Actually the only example that JuliaDB works with TrueFX data is not that useful. Since TrueFX has very good API in R and Python, I can directly get the data into R or Python with customized filtering conditions, there is no need to download so many csv files and then load them into JuliaDB.

I hope JuliaDB could use a more useful example.

CSV.jl can read the data now without much help! But it’s still slower than using R’s data.table and then reading the data into Julia using @rget. So performance is still an issue.

download("https://github.com/xiaodaigh/testing/raw/master/Performance_2016Q1.zip", "ok.zip")

run(`unzip ok.zip`)

using DataFrames, TableReader

@time a = readcsv("Performance_2016Q1.csv", delim ='|', header=false, chunkszie = 0)


using FileIO, TextParse, CSV, DataFrames
@time b = CSV.read("Performance_2016Q1.csv", delim = '|', header=0)
@time adf = DataFrame(b) # 22 seconds


using RCall

function a()
R"""
adf1 = data.table::fread('Performance_2016Q1.csv')
"""
@rget adf1
R"rm(adf1)"
adf1
end

@time a() # 15 seconds

I see you have TableReader in there, but don’t report any timings?
I tried it, and it seems like the fastest of the three!
Timings:
R.fread: 20s
CSV: 12s
TableReader: 8s

using TableReader, CSV, RCall
function fread()
    R"""
    adf1 = data.table::fread('/home/sd/Downloads/Performance_2016Q1.csv')
    """
    @rget adf1
    R"rm(adf1)"
    adf1
end

@time readcsv(path, delim = '|', chunksize = 0, header = [Symbol("var_$i") for i in 1:31])
@time CSV.read(path, delim = '|', header=0)
@time fread()

Now I will try it! It ran with a bug so I reported it, but the magic seems to be header = ...

Now I am getting an error which i have reported here.

doing data.table::fread in R should still be faster than.

I’m hitting the same TableReader problem (on Windows) with a fairly small tsv file when specifying the delim=‘\t’ keyword.

1 Like

Just quick question, is there any function to read fst files in Julia 1.1?

FstFiles.jl or FstFormatFiles.jl part of the Queryverse. But it doesn’t load them natively. It uses RCall so data transfer is slow

Do u think u can run the above on Windows and see if you run into issues?

On my slower windows PC I get:

TableReader: ReadOnlyMemoryError()
CSV: 25s
fread: 29s

Ok, so we are seeing the same issues.

Oh wow. Using R’s data.table via RCall and then writing the file to disk using feather and then reading it in using Feather.jl is faster than all the CSV parsers in the Julia-verse I’ve tried.

See it for yourself

download("https://github.com/xiaodaigh/testing/raw/master/Performance_2016Q1.zip", "ok.zip")

run(`unzip ok.zip`)
using FileIO, CSVFiles, DataFrames

using RCall, DataFrames, Feather
path = "Performance_2016Q1.csv"

function fread(path)
  R"""
  memory.limit(4095*2)  
  feather::write_feather(data.table::fread($path), "x.feather")
  gc()
  """;  

Feather.read("x.feather")
end

@time a = fread(path) # 20 seconds vs 50 seconds vs Julia CSV readers
1 Like

I have a hard time understanding why this is faster than rget! In any case, maybe you could write a small package with this function?

Because Feather was specifically designed for performance, while csv was born out of convenience…

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

    Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible

    Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.

    High read and write performance. When possible, Feather operations should be bound by local disk performance.
1 Like

The weird thing is that, to communicate between R and Julia, rget is slower than writing/reading a feather file.

Why is that weird?
Communication between R & Julia should be very fast, reading/writing feather should be substantially faster than reading a CSV file, even if you use the absolute best CSV reader.

The issue isn’t rget is slower it’s that reading the CSV in R then write it out to feather and then reading it in is faster than reading the CSV directly in Julia

Maybe a faster CSV reader in Julia will take time to emerge or it’s not possible at all. It’s not certain either way for me.

Ah, guess I should have looked at what rget actually does… I didn’t realize, that the R benchmark never actually benchmarked the pure R performance, but instead also converting it to Julia etc!

Oh I see. So rget is still faster than reading/writing with feather? In any case, could you create a fread package that does this kind of stuff automatically? I think it would be useful.

1 Like

I guess so… this is so weird though. All Julia CSV readers feel lethargic atm. Grouping operstions are improving though.