Suggestions for a package to read tabular data

question

#1

I have thousands of *.csv files to process, which contains table data. I usually use Python + Pandas. To get high performance, I tried to transfer from Python to Julia for some data processing work. I tested DataFrames.jl and Pandas, and found that the speed of Pandas is much higher. Is there any other package in Julia which can run a bit faster than Pandas?

I found there are several other packages to handle table data. Can anyone give me some suggestions on the proper package?
DataFrames.jl
TypedTables.jl
DataTables.jl


#2

CSV.jl?


#3

What about Pandas.jl?


#4

Out of curiosity, is this with the NullableArray backend?


#5

Thanks! I have also tested CSV.jl, and it is slower than has similar speed as DataFrames.jl for my data (a little bit faster). I have thousands of files, and thus I care about the reading speed of loading the file.


#6

Thanks for your suggestion. I will try this package.
For DataFrames.jl, it runs on windows 7+ Julia 0.5 + DataFrames 0.8.5.


#7

Okay, you’re not using master. You may want to read this:

This should be an update very soon which has a lot of changes, and one big change is increased performance. This may be worth holding out for.


#8

Sounds exciting! I will wait for the new package.
Once the CSV file is loaded, the speed row iteration of DataFrame.jl+Julia is much faster than Pandas+Python.


#9

Can you share your experience with CSV.jl? The package aims to be the fastest csv reader for Julia and in most cases, that has been the experience. I’d love to hear how it’s falling short for you. (Interested because I’m the primary package author :smiley:)


#10

I tested the CSV.jl. It is a bit faster than DataFrames.jl for my data, but also much slower than Pandas+Python. It is very convenient to use.Thanks for your work.


#11

Once the CSV file is loaded, the speed row iteration of DataFrame.jl+Julia is much faster than Pandas+Python.

In case you are interested, TypedTables is a package specifically designed for fast row iteration. Unfortunately, it requires the author to do a few small contortions to get it in the right format/type, extract the rows, use a functional workflow, etc, and is definitely less flexible that DataFrames, but I thought I might mention it in case speed is important and you are writing your own code to process rows.

(Note: DataFrames used with care, or with one of the querying frameworks, can be similarly fast. When using either package, the thing that needs to be done right is to give the compiler the information of the concrete types stored in the table when it is processing the data. If you are iterating over rows yourself with DataFrames, you should get a large speedup from using a strongly-typed approach).


#12

Thanks for your suggestion. I will take a look. The bottleneck for my code is the time of loading the data. There are thousands of files to process. The row-iteration operation of packages in Julia is very fast compared with Python+Pandas.


#13

What is(/are) the source(s) for the CSV files?
Do you know what format they are in? (MS Excel outputs RFC-4180 compliant files, MySQL by default does not.)
It will be important to know what the character set encoding of the files is also (frequently Windows CP-1252 or ANSI Latin 1, or UTF-8) (unless you just have numerical data, in which case you won’t have to worry about different CSV variants or character set encoding).