Alternative to DataFrame Readtable to read large data files with headers

astrowanders · September 17, 2018, 9:08pm

Hello!

I’ve been using Julia to work with output from a stellar evolution code. My data files have several columns (~60 or more) and hundreds of rows. In previous versions of Julia, I used readtable in the DataFrame package to work with my data. I really appreciated that readtable not only read the numerical data, but also the headers, which I could then call by their names. For example: if I read file X as

data = readtable(X, skipstart=5, separator = ’ ')

and one of the column headers is “star_age”, then I could operate on, plot with, etc. the that column with

data[:star_age]

Since readtable has been depreciated, it is not clear to me what the best package to use is. It’s been previously noted that read in the CSV package is very slow. I’ve also found that CSV.read and DelimitedFiles.readdlm require specifying the header names? Since my files have ~60 columns, I’d rather not do that.

bkamins · September 17, 2018, 9:23pm

You have many options here.

CSV.jl package is actually very fast.
readdlm from DelimitedFiles module can read column names with header option.

Another alternative is https://github.com/queryverse/CSVFiles.jl (which is a bit slower than CSV.jl but does a bit more validations by default than CSV.jl).

In general for the size of data you describe any of those options should have sub-second run time.

astrowanders · September 17, 2018, 9:58pm

Thanks @bkamins!

The previous thread on CSV.read has others who experience it taking nearly an hour to read files - this has been my experience, although I haven’t timed it.

With the header option as true in DelimitedFiles.readdlm, I get the headers as an array separate from the numbers. However DataFrames.readtable seemed to make the headers into structures, so I didn’t need to manually specify the header names.

Unless there are ways to make CSV faster and either language recognize the headers so they’re easy to work with, I’m not sure what the most efficient way to proceed is.

bkamins · September 17, 2018, 10:17pm

If you have problems with CSV.jl performance please report an issue - this for sure can be fixed.

If the options I have given really do not work for you for some reasons try GitHub - bkamins/Nanocsv.jl: A minimal implementation of CSV reader/writer for Julia. It is not as fancy as other packages but was designed to load standard CSVs directly into a DataFrame.

aaowens · September 17, 2018, 11:21pm

The thread you linked is somewhat old and seems to have been using Julia 0.6. There might no longer be problems if you use Julia 1.0. Have you tried this?

If you like R and data.table, you could use RCall.jl to read the data in R and pull it into Julia as a DataFrame.

quinnj · September 18, 2018, 5:53am

As others have mentioned, please file an issue if you notice performance issues. I’m always eager to improve things.

tk3369 · September 18, 2018, 7:18am

Here’s another alternative:

I happened to have cooked up a CSV reader this past weekend. The goal is to read small/medium sized files into DataFrame more quickly. It’s very WIP but somewhat functional.

nalimilan · September 18, 2018, 7:47am

To be clear, neither CSV.jl nor DelimitedFiles require specifying the column names manually.

astrowanders · September 18, 2018, 4:51pm

Hello!

Thank you all for your responses and suggestions. I’m following up with more details about my issues with CSV.read.

Here is a link to an example output file from the stellar evolution code that I mentioned. It has 61 columns and 177 rows of data, not considering all of the header data.

With readtable in DataFrame, which is now depreciated, I was able to read all of the data, and the entire file was imported as a data frame. Here is a screenshot

In contrast, with read in CSV, it takes a noticeably longer time to read the same file, and it doesn’t do so correctly in either Julia 0.7 or Julia 1.0 Here is a screenshot, where the file is read as 177 rows and 141 columns rather than the 61 actual columns. There are also a lot of missing values, which are not actually missing in the file, nor were noted with DataFrame.readtable.

I’ll note that I made datarow here 7 rather than skipstart=5 in DataFrame.readtable above because they are defined slightly differently.

I used validate in CSV to get detailed information about why CSV.read failed, and was met with the following error concerning too many columns.

So going back to why I started this thread: How should I efficiently import a file with a large number of columns now that DataFrames.readtable has been depreciated? Is there a way to make CSV.read work for my data files? Is there a different package I should use?

aaowens · September 18, 2018, 5:34pm

That’s interesting. It works well in R with data.table, and it reads the file instantly. You might try this with RCall.jl.

using RCall
R> library(data.table)
R> out = fread("history.data", header = TRUE, skip = "model_number")
julia> out = @rget out
177×61 DataFrames.DataFrame. Omitted printing of 56 columns
│ Row │ model_number │ num_zones │ star_age   │ log_dt    │ star_mass │
├─────┼──────────────┼───────────┼────────────┼───────────┼───────────┤
│ 1   │ 1            │ 1446      │ 0.00861171 │ -2.06491  │ 1.29999   │
│ 2   │ 5            │ 1437      │ 0.0640849  │ -1.74819  │ 1.29999   │
│ 3   │ 10           │ 1442      │ 0.223549   │ -1.35228  │ 1.29999   │
│ 4   │ 15           │ 1424      │ 0.620346   │ -0.956373 │ 1.29999   │
│ 5   │ 20           │ 1406      │ 1.53212    │ -0.676248 │ 1.29999   │
│ 6   │ 25           │ 1356      │ 3.27152    │ -0.3956   │ 1.29999   │
│ 7   │ 30           │ 1326      │ 6.4055     │ -0.114577 │ 1.29999   │
│ 8   │ 35           │ 1283      │ 12.2081    │ 0.143662  │ 1.29999   │

evanfields · September 18, 2018, 5:39pm

I think your problem is that your history.data file is nastily formatted. There’s an irregular header of 5 rows, as you note. More importantly, the separator character is not a comma but instead whitespace of variable length! Yikes!

I replaced all horizontal whitespace (regex /h+, though I don’t know regex and that was just from a quick search, perhaps there’s a better way) with the , character. Then doing CSV.read("/path/to/history.data"; header = 6) works fine and takes 0.05 seconds.

Perhaps CSV.jl could auto-detect the separator or easily support variable-length whitespace separators, though I assume that’d come at a speed penalty.

pdeffebach · September 18, 2018, 5:42pm

However R works flawlessly! so thats the standard we have to use.

OP you should file an issue because I can also replicate. I could imagine some sort of delim = '\s' option for data that looks like yours.

nalimilan · September 18, 2018, 7:44pm

CSV.read has gained a new skiprepeated keyword argument recently, but it’s not yet in a released version (see this PR).

Also note that even if readtable is deprecated, it won’t be removed from DataFrames until all of these issues are fixed, so you can keep using it for now.

astrowanders · September 19, 2018, 5:48pm

Thanks everyone for your time!

@evanfields I realize that the file has a unideal format I wouldn’t want to change the formatting because:

Since I have several files formatted the same way, it would be time intensive to add commas like you did.
More importantly: The stellar evolution code both writes and reads files with this formatting, so if I did change the file formatting, I couldn’t use them in the code again

I would say that astrophysicists are clumsy coders, but one of the writers of the stellar evolution code is a real computer scientist who developed the PDF with Adobe.

@nalimilan When I try to use readtable with Julia 1.0, not only do I get a depreciation warning, but also a Method Error. With Julia 0.6.4, I see the depreciation warning, but not the Method Error warning, so it does actually work. Is this a version difference? I have DataFrames 0.11.6 with Julia 0.6.4 and DataFrames 0.13.1 with Julia 1.0

@aaowens & @pdeffebach Thanks for the notes about using R. I haven’t used RCall in Julia before. When I tried to call data.table, I get this error - I feel like I must be missing something basic about RCall

I’ll add that these are the current versions of the packages that I’m using with Julia 1.0

aaowens · September 19, 2018, 5:59pm

You need to install the data.table package in R. Try install.packages("data.table") in R

astrowanders · September 19, 2018, 6:19pm

Thank you @aaowens!!!

nalimilan · September 19, 2018, 6:50pm

Good catch, we missed a deprecation. I’ve filed a PR.

FWIW, better use Julia 0.7 for some time as it prints deprecation warnings in many cases where Julia 1.0 just errors. Packages haven’t all been completely ported yet.

davidanthoff · November 12, 2018, 7:30pm

CSVFiles.jl reads the file properly:

using CSVFiles

load("history.data.csv", spacedelim=true, skiplines_begin=5)

Note that I renamed the file extension here, so that load automatically picks the right format. If you don’t want to do that, you can also do this:

using CSVFiles, FileIO

load(File(format"CSV", "history.data"), spacedelim=true, skiplines_begin=5)

If you want the data in a DataFrame, just pipe it into one (or any other tabular data structure):

using CSVFiles, DataFrames

df = load("history.data.csv", spacedelim=true, skiplines_begin=5) |> DataFrame

Topic		Replies	Views
Suggestions for a package to read tabular data Data question	12	2724	February 13, 2017
[ANN] Fread.jl - read CSVs faster with the help of R's {data.table} Package Announcements performance , data , csv	6	2053	October 9, 2019
Best practice for reading large matrix with repeated spaces as text General Usage dataframes , csv	8	516	April 20, 2023
Access columns by variable name with `readdlm` from `DelimitedFiles` General Usage	4	1129	October 15, 2019
CSV Reader / Writer Choices Data	1	735	August 28, 2018

Alternative to DataFrame Readtable to read large data files with headers

Related topics