I’m pleased to announce a new release of the CSV.jl package.
This release provides notable improvements in several areas, including performance, additional features, and enhanced flexibility.
Notable improvements:
“Perfect” column typing: gone are the days of rows_for_type_detect and parsing getting messed up after 10K rows. CSV.jl now gets column types right every time, and without needing to restart parsing.
Auto delimiter detection: don’t worry about keeping track of which file has which delimiter; CSV.jl will figure it out for you!
Better automatic handling of invalid files: invalid values? wrong number of values on a row? CSV.jl will handle such files gracefully, printing helpful messages about anything unexpected it runs into
Improved performance: great care has been taken to improve performance on several levels; underlying type parsers (provided by Parsers.jl), better data locality and cache friendliness, and greater use of custom Julia structures for efficiency
Enhanced APIs for the CSV.File type: in addition to allowing iteration over rows directly, it now provides getproperty to access efficient read-only columns of the underlying data. If mutable columns are needed, you can copy(col) or use CSV.read(file; copycols=true).
These improvements are in addition to many smaller bugfixes and quality of life enhancements. Great effort has been taken to ensure CSV.jl provides a rich set of features, comparable or better than other world-class csv parsers. (see the feature comparison table below!)
As always, please open issues as you run into bugs or performance issues and we’ll try to address things as quickly as possible. Cheers!
It would be very useful if you (or someone who participated in this optimization) could write up a short blog post or a summary that links the relevant PRs and discusses what lead to practical improvements. Learning from an actual use case that was optimized by experts is always instructive.
Immediately noticed the following parse behaviour change:
428.E+03 is default parsed as String now, used to be as Float64 with previous version of CSV.jl.
If I try to force with types=Dict(Symbol(" S-Mises")=>Float64)) for the column in question, I get warnings/errors like
warnings: error parsing Float64 on row = 193959, col = 13: " 428.E+03,", error=INVALID: OK | SENTINEL | DELIMITED | INVALID_DELIMITER
at all lines that are missing a digit between the . and the E in scientific notation. If I hand edit the source file column value in question to eg. 428.0E+03 the warning/error goes away.
The key trade-off is if you want to work with DataFrames.jl later if you want to materialize a DataFrame using DataFrame or DataFrame! - or equivalently if you use copycols=true vs copycols=false keyword argument in CSV.read (I am not trying to give a comprehensive answer to what @Tamas_Papp wants as I have not developed the change but gives my major consideration when using current design from DataFrames.jl perspective - maybe @quinnj can comment more).
I will use the following DataFrame with 10^7 rows as an example (assuming it is written to disk as testfile.txt).
if you read in the data and do only a few operations on it use DataFrame! (i.e. copycols=false)
if you read in the data and to a lot of different operations on it probably using DataFrame (i.e. copycols=true) will be faster.
Of course another consideration is that DataFrame! returns a read-only DataFrame, so if you want to mutate vectors you have read in you have to use DataFrame constructor.
Thanks for reporting! Would you mind opening an issue on the CSV.jl or Parsers.jl repo? This must have regressed with all the new work that’s gone in, should be a simple fix.
CSV.jl defines CSV.getcell(f::CSV.File, T, col, row) which would be inferrable for individual values. It also doesn’t require iteration. You can get the types for a file by doing CSV.gettypes(f).
@quinnj Great update so far, really enjoying using it! Just one small thing I ran into: When my file ends with a line, containing only one comment, a line of missings gets added at the end.
MWE:
Just an update here for those interested: CSV.jl now has multithreaded parsing support on current #master branch, based on the new multithreading support in Julia version 1.3. I wrote up a quick blogpost with some benchmarking comparisons vs. R’s fread and pandas. I’m happy to report that CSV.jl has built upon previous performance efforts and the multithreaded case is competitive with fread. It’s a great time for csv reading in Julia! Check out the blogpost here: Everyone’s Favorite Blogpost: CSV Benchmarks – Traitement de Données.
(also, it looks like I can’t edit the original feature comparison table in the post above, so you’ll have to visualize for yourself that CSV.jl now has a check in the “Multi-threaded parsing” box ).
Which parts do you like? I am a big data.table user.
df[,.N, by1] is definitely handy but I prefer dplyr for easier to read code sometimes. DataFramesMeta.jl is the closest thing. Avoid Query.jl if you want fast group-by performance.