I’ve been using Julia to work with output from a stellar evolution code. My data files have several columns (~60 or more) and hundreds of rows. In previous versions of Julia, I used readtable in the DataFrame package to work with my data. I really appreciated that readtable not only read the numerical data, but also the headers, which I could then call by their names. For example: if I read file X as
data = readtable(X, skipstart=5, separator = ’ ')
and one of the column headers is “star_age”, then I could operate on, plot with, etc. the that column with
Since readtable has been depreciated, it is not clear to me what the best package to use is. It’s been previously noted that read in the CSV package is very slow. I’ve also found that CSV.read and DelimitedFiles.readdlm require specifying the header names? Since my files have ~60 columns, I’d rather not do that.
The previous thread on CSV.read has others who experience it taking nearly an hour to read files - this has been my experience, although I haven’t timed it.
With the header option as true in DelimitedFiles.readdlm, I get the headers as an array separate from the numbers. However DataFrames.readtable seemed to make the headers into structures, so I didn’t need to manually specify the header names.
Unless there are ways to make CSV faster and either language recognize the headers so they’re easy to work with, I’m not sure what the most efficient way to proceed is.
If you have problems with CSV.jl performance please report an issue - this for sure can be fixed.
If the options I have given really do not work for you for some reasons try https://github.com/bkamins/Nanocsv.jl. It is not as fancy as other packages but was designed to load standard CSVs directly into a DataFrame.
In contrast, with read in CSV, it takes a noticeably longer time to read the same file, and it doesn’t do so correctly in either Julia 0.7 or Julia 1.0 Here is a screenshot, where the file is read as 177 rows and 141 columns rather than the 61 actual columns. There are also a lot of missing values, which are not actually missing in the file, nor were noted with DataFrame.readtable.
So going back to why I started this thread: How should I efficiently import a file with a large number of columns now that DataFrames.readtable has been depreciated? Is there a way to make CSV.read work for my data files? Is there a different package I should use?
I think your problem is that your history.data file is nastily formatted. There’s an irregular header of 5 rows, as you note. More importantly, the separator character is not a comma but instead whitespace of variable length! Yikes!
I replaced all horizontal whitespace (regex /h+, though I don’t know regex and that was just from a quick search, perhaps there’s a better way) with the , character. Then doing CSV.read("/path/to/history.data"; header = 6) works fine and takes 0.05 seconds.
Perhaps CSV.jl could auto-detect the separator or easily support variable-length whitespace separators, though I assume that’d come at a speed penalty.
@evanfields I realize that the file has a unideal format I wouldn’t want to change the formatting because:
Since I have several files formatted the same way, it would be time intensive to add commas like you did.
More importantly: The stellar evolution code both writes and reads files with this formatting, so if I did change the file formatting, I couldn’t use them in the code again
I would say that astrophysicists are clumsy coders, but one of the writers of the stellar evolution code is a real computer scientist who developed the PDF with Adobe.
@nalimilan When I try to use readtable with Julia 1.0, not only do I get a depreciation warning, but also a Method Error. With Julia 0.6.4, I see the depreciation warning, but not the Method Error warning, so it does actually work. Is this a version difference? I have DataFrames 0.11.6 with Julia 0.6.4 and DataFrames 0.13.1 with Julia 1.0
@aaowens & @pdeffebach Thanks for the notes about using R. I haven’t used RCall in Julia before. When I tried to call data.table, I get this error - I feel like I must be missing something basic about RCall
I’ll add that these are the current versions of the packages that I’m using with Julia 1.0