Suggestions for a package to read tabular data

zhangliye · February 13, 2017, 3:30am

I have thousands of *.csv files to process, which contains table data. I usually use Python + Pandas. To get high performance, I tried to transfer from Python to Julia for some data processing work. I tested DataFrames.jl and Pandas, and found that the speed of Pandas is much higher. Is there any other package in Julia which can run a bit faster than Pandas?

I found there are several other packages to handle table data. Can anyone give me some suggestions on the proper package?
DataFrames.jl
TypedTables.jl
DataTables.jl

stevengj · February 13, 2017, 3:36am

CSV.jl?

ChrisRackauckas · February 13, 2017, 3:37am

What about Pandas.jl?

ChrisRackauckas · February 13, 2017, 3:37am

Out of curiosity, is this with the NullableArray backend?

zhangliye · February 13, 2017, 3:40am

Thanks! I have also tested CSV.jl, and it ~~is slower than~~ has similar speed as DataFrames.jl for my data (a little bit faster). I have thousands of files, and thus I care about the reading speed of loading the file.

zhangliye · February 13, 2017, 3:45am

Thanks for your suggestion. I will try this package.
For DataFrames.jl, it runs on windows 7+ Julia 0.5 + DataFrames 0.8.5.

ChrisRackauckas · February 13, 2017, 3:48am

Okay, you’re not using master. You may want to read this:

This should be an update very soon which has a lot of changes, and one big change is increased performance. This may be worth holding out for.

zhangliye · February 13, 2017, 3:55am

Sounds exciting! I will wait for the new package.
Once the CSV file is loaded, the speed row iteration of DataFrame.jl+Julia is much faster than Pandas+Python.

quinnj · February 13, 2017, 4:06am

Can you share your experience with CSV.jl? The package aims to be the fastest csv reader for Julia and in most cases, that has been the experience. I’d love to hear how it’s falling short for you. (Interested because I’m the primary package author )

zhangliye · February 13, 2017, 4:44am

I tested the CSV.jl. It is a bit faster than DataFrames.jl for my data, but also much slower than Pandas+Python. It is very convenient to use.Thanks for your work.

andyferris · February 13, 2017, 5:14am

Once the CSV file is loaded, the speed row iteration of DataFrame.jl+Julia is much faster than Pandas+Python.

In case you are interested, TypedTables is a package specifically designed for fast row iteration. Unfortunately, it requires the author to do a few small contortions to get it in the right format/type, extract the rows, use a functional workflow, etc, and is definitely less flexible that DataFrames, but I thought I might mention it in case speed is important and you are writing your own code to process rows.

(Note: DataFrames used with care, or with one of the querying frameworks, can be similarly fast. When using either package, the thing that needs to be done right is to give the compiler the information of the concrete types stored in the table when it is processing the data. If you are iterating over rows yourself with DataFrames, you should get a large speedup from using a strongly-typed approach).

zhangliye · February 13, 2017, 5:22am

Thanks for your suggestion. I will take a look. The bottleneck for my code is the time of loading the data. There are thousands of files to process. The row-iteration operation of packages in Julia is very fast compared with Python+Pandas.

ScottPJones · February 13, 2017, 8:55am

What is(/are) the source(s) for the CSV files?
Do you know what format they are in? (MS Excel outputs RFC-4180 compliant files, MySQL by default does not.)
It will be important to know what the character set encoding of the files is also (frequently Windows CP-1252 or ANSI Latin 1, or UTF-8) (unless you just have numerical data, in which case you won’t have to worry about different CSV variants or character set encoding).

Topic		Replies	Views
Package for tabular data Data	12	1532	November 23, 2018
Alternative to DataFrame Readtable to read large data files with headers Data	17	4045	November 12, 2018
Julia cookbook available New to Julia	13	1454	April 22, 2019
R's dplyr and data.table 2x faster than Julia's DataFrames.jl + libraries New to Julia	9	1709	September 30, 2020
Is python pandas faster than julia CSV? General Usage csv	3	962	June 28, 2020

Suggestions for a package to read tabular data

Related topics