Alternative to DataFrame Readtable to read large data files with headers


#1

Hello!

I’ve been using Julia to work with output from a stellar evolution code. My data files have several columns (~60 or more) and hundreds of rows. In previous versions of Julia, I used readtable in the DataFrame package to work with my data. I really appreciated that readtable not only read the numerical data, but also the headers, which I could then call by their names. For example: if I read file X as

data = readtable(X, skipstart=5, separator = ’ ')

and one of the column headers is “star_age”, then I could operate on, plot with, etc. the that column with

data[:star_age]

Since readtable has been depreciated, it is not clear to me what the best package to use is. It’s been previously noted that read in the CSV package is very slow. I’ve also found that CSV.read and DelimitedFiles.readdlm require specifying the header names? Since my files have ~60 columns, I’d rather not do that.


#2

You have many options here.

CSV.jl package is actually very fast.
readdlm from DelimitedFiles module can read column names with header option.

Another alternative is https://github.com/queryverse/CSVFiles.jl (which is a bit slower than CSV.jl but does a bit more validations by default than CSV.jl).

In general for the size of data you describe any of those options should have sub-second run time.


#3

Thanks @bkamins!

The previous thread on CSV.read has others who experience it taking nearly an hour to read files - this has been my experience, although I haven’t timed it.

With the header option as true in DelimitedFiles.readdlm, I get the headers as an array separate from the numbers. However DataFrames.readtable seemed to make the headers into structures, so I didn’t need to manually specify the header names.

Unless there are ways to make CSV faster and either language recognize the headers so they’re easy to work with, I’m not sure what the most efficient way to proceed is.


#4

If you have problems with CSV.jl performance please report an issue - this for sure can be fixed.

If the options I have given really do not work for you for some reasons try https://github.com/bkamins/Nanocsv.jl. It is not as fancy as other packages but was designed to load standard CSVs directly into a DataFrame.


#5

The thread you linked is somewhat old and seems to have been using Julia 0.6. There might no longer be problems if you use Julia 1.0. Have you tried this?

If you like R and data.table, you could use RCall.jl to read the data in R and pull it into Julia as a DataFrame.


#6

As others have mentioned, please file an issue if you notice performance issues. I’m always eager to improve things.


#7

Here’s another alternative:

I happened to have cooked up a CSV reader this past weekend. The goal is to read small/medium sized files into DataFrame more quickly. It’s very WIP but somewhat functional.


#8

To be clear, neither CSV.jl nor DelimitedFiles require specifying the column names manually.


#9

Hello!

Thank you all for your responses and suggestions. I’m following up with more details about my issues with CSV.read.

Here is a link to an example output file from the stellar evolution code that I mentioned. It has 61 columns and 177 rows of data, not considering all of the header data.

With readtable in DataFrame, which is now depreciated, I was able to read all of the data, and the entire file was imported as a data frame. Here is a screenshot

In contrast, with read in CSV, it takes a noticeably longer time to read the same file, and it doesn’t do so correctly in either Julia 0.7 or Julia 1.0 Here is a screenshot, where the file is read as 177 rows and 141 columns rather than the 61 actual columns. There are also a lot of missing values, which are not actually missing in the file, nor were noted with DataFrame.readtable.

I’ll note that I made datarow here 7 rather than skipstart=5 in DataFrame.readtable above because they are defined slightly differently.

I used validate in CSV to get detailed information about why CSV.read failed, and was met with the following error concerning too many columns.

So going back to why I started this thread: How should I efficiently import a file with a large number of columns now that DataFrames.readtable has been depreciated? Is there a way to make CSV.read work for my data files? Is there a different package I should use?


#10

That’s interesting. It works well in R with data.table, and it reads the file instantly. You might try this with RCall.jl.

using RCall
R> library(data.table)
R> out = fread("history.data", header = TRUE, skip = "model_number")
julia> out = @rget out
177×61 DataFrames.DataFrame. Omitted printing of 56 columns
│ Row │ model_number │ num_zones │ star_age   │ log_dt    │ star_mass │
├─────┼──────────────┼───────────┼────────────┼───────────┼───────────┤
│ 1   │ 1            │ 1446      │ 0.00861171 │ -2.06491  │ 1.29999   │
│ 2   │ 5            │ 1437      │ 0.0640849  │ -1.74819  │ 1.29999   │
│ 3   │ 10           │ 1442      │ 0.223549   │ -1.35228  │ 1.29999   │
│ 4   │ 15           │ 1424      │ 0.620346   │ -0.956373 │ 1.29999   │
│ 5   │ 20           │ 1406      │ 1.53212    │ -0.676248 │ 1.29999   │
│ 6   │ 25           │ 1356      │ 3.27152    │ -0.3956   │ 1.29999   │
│ 7   │ 30           │ 1326      │ 6.4055     │ -0.114577 │ 1.29999   │
│ 8   │ 35           │ 1283      │ 12.2081    │ 0.143662  │ 1.29999   │


#11

I think your problem is that your history.data file is nastily formatted. There’s an irregular header of 5 rows, as you note. More importantly, the separator character is not a comma but instead whitespace of variable length! Yikes!

I replaced all horizontal whitespace (regex /h+, though I don’t know regex and that was just from a quick search, perhaps there’s a better way) with the , character. Then doing CSV.read("/path/to/history.data"; header = 6) works fine and takes 0.05 seconds.

Perhaps CSV.jl could auto-detect the separator or easily support variable-length whitespace separators, though I assume that’d come at a speed penalty.


#12

However R works flawlessly! so thats the standard we have to use.

OP you should file an issue because I can also replicate. I could imagine some sort of delim = '\s' option for data that looks like yours.


#13

CSV.read has gained a new skiprepeated keyword argument recently, but it’s not yet in a released version (see this PR).

Also note that even if readtable is deprecated, it won’t be removed from DataFrames until all of these issues are fixed, so you can keep using it for now.


#14

Thanks everyone for your time!

@evanfields I realize that the file has a unideal format :frowning: I wouldn’t want to change the formatting because:

  • Since I have several files formatted the same way, it would be time intensive to add commas like you did.
  • More importantly: The stellar evolution code both writes and reads files with this formatting, so if I did change the file formatting, I couldn’t use them in the code again

I would say that astrophysicists are clumsy coders, but one of the writers of the stellar evolution code is a real computer scientist who developed the PDF with Adobe.

@nalimilan When I try to use readtable with Julia 1.0, not only do I get a depreciation warning, but also a Method Error. With Julia 0.6.4, I see the depreciation warning, but not the Method Error warning, so it does actually work. Is this a version difference? I have DataFrames 0.11.6 with Julia 0.6.4 and DataFrames 0.13.1 with Julia 1.0

@aaowens & @genauguy Thanks for the notes about using R. I haven’t used RCall in Julia before. When I tried to call data.table, I get this error - I feel like I must be missing something basic about RCall


I’ll add that these are the current versions of the packages that I’m using with Julia 1.0


#15

You need to install the data.table package in R. Try install.packages("data.table") in R


#16

Thank you @aaowens!!!


#17

Good catch, we missed a deprecation. I’ve filed a PR.

FWIW, better use Julia 0.7 for some time as it prints deprecation warnings in many cases where Julia 1.0 just errors. Packages haven’t all been completely ported yet.


#18

CSVFiles.jl reads the file properly:

using CSVFiles

load("history.data.csv", spacedelim=true, skiplines_begin=5)

Note that I renamed the file extension here, so that load automatically picks the right format. If you don’t want to do that, you can also do this:

using CSVFiles, FileIO

load(File(format"CSV", "history.data"), spacedelim=true, skiplines_begin=5)

If you want the data in a DataFrame, just pipe it into one (or any other tabular data structure):

using CSVFiles, DataFrames

df = load("history.data.csv", spacedelim=true, skiplines_begin=5) |> DataFrame