I’ve been using Julia to work with output from a stellar evolution code. My data files have several columns (~60 or more) and hundreds of rows. In previous versions of Julia, I used readtable in the DataFrame package to work with my data. I really appreciated that readtable not only read the numerical data, but also the headers, which I could then call by their names. For example: if I read file X as
data = readtable(X, skipstart=5, separator = ’ ')
and one of the column headers is “star_age”, then I could operate on, plot with, etc. the that column with
data[:star_age]
Since readtable has been depreciated, it is not clear to me what the best package to use is. It’s been previously noted that read in the CSV package is very slow. I’ve also found that CSV.read and DelimitedFiles.readdlm require specifying the header names? Since my files have ~60 columns, I’d rather not do that.
The previous thread on CSV.read has others who experience it taking nearly an hour to read files - this has been my experience, although I haven’t timed it.
With the header option as true in DelimitedFiles.readdlm, I get the headers as an array separate from the numbers. However DataFrames.readtable seemed to make the headers into structures, so I didn’t need to manually specify the header names.
Unless there are ways to make CSV faster and either language recognize the headers so they’re easy to work with, I’m not sure what the most efficient way to proceed is.
The thread you linked is somewhat old and seems to have been using Julia 0.6. There might no longer be problems if you use Julia 1.0. Have you tried this?
If you like R and data.table, you could use RCall.jl to read the data in R and pull it into Julia as a DataFrame.
I happened to have cooked up a CSV reader this past weekend. The goal is to read small/medium sized files into DataFrame more quickly. It’s very WIP but somewhat functional.
Thank you all for your responses and suggestions. I’m following up with more details about my issues with CSV.read.
Here is a link to an example output file from the stellar evolution code that I mentioned. It has 61 columns and 177 rows of data, not considering all of the header data.
With readtable in DataFrame, which is now depreciated, I was able to read all of the data, and the entire file was imported as a data frame. Here is a screenshot
In contrast, with read in CSV, it takes a noticeably longer time to read the same file, and it doesn’t do so correctly in either Julia 0.7 or Julia 1.0 Here is a screenshot, where the file is read as 177 rows and 141 columns rather than the 61 actual columns. There are also a lot of missing values, which are not actually missing in the file, nor were noted with DataFrame.readtable.
So going back to why I started this thread: How should I efficiently import a file with a large number of columns now that DataFrames.readtable has been depreciated? Is there a way to make CSV.read work for my data files? Is there a different package I should use?
I think your problem is that your history.data file is nastily formatted. There’s an irregular header of 5 rows, as you note. More importantly, the separator character is not a comma but instead whitespace of variable length! Yikes!
I replaced all horizontal whitespace (regex /h+, though I don’t know regex and that was just from a quick search, perhaps there’s a better way) with the , character. Then doing CSV.read("/path/to/history.data"; header = 6) works fine and takes 0.05 seconds.
Perhaps CSV.jl could auto-detect the separator or easily support variable-length whitespace separators, though I assume that’d come at a speed penalty.
CSV.read has gained a new skiprepeated keyword argument recently, but it’s not yet in a released version (see this PR).
Also note that even if readtable is deprecated, it won’t be removed from DataFrames until all of these issues are fixed, so you can keep using it for now.
@evanfields I realize that the file has a unideal format I wouldn’t want to change the formatting because:
Since I have several files formatted the same way, it would be time intensive to add commas like you did.
More importantly: The stellar evolution code both writes and reads files with this formatting, so if I did change the file formatting, I couldn’t use them in the code again
I would say that astrophysicists are clumsy coders, but one of the writers of the stellar evolution code is a real computer scientist who developed the PDF with Adobe.
@nalimilan When I try to use readtable with Julia 1.0, not only do I get a depreciation warning, but also a Method Error. With Julia 0.6.4, I see the depreciation warning, but not the Method Error warning, so it does actually work. Is this a version difference? I have DataFrames 0.11.6 with Julia 0.6.4 and DataFrames 0.13.1 with Julia 1.0
@aaowens & @pdeffebach Thanks for the notes about using R. I haven’t used RCall in Julia before. When I tried to call data.table, I get this error - I feel like I must be missing something basic about RCall
Good catch, we missed a deprecation. I’ve filed a PR.
FWIW, better use Julia 0.7 for some time as it prints deprecation warnings in many cases where Julia 1.0 just errors. Packages haven’t all been completely ported yet.