I’m pleased to announce a new release of the CSV.jl package.
This release provides notable improvements in several areas, including performance, additional features, and enhanced flexibility.
“Perfect” column typing: gone are the days of rows_for_type_detect and parsing getting messed up after 10K rows. CSV.jl now gets column types right every time, and without needing to restart parsing.
Auto delimiter detection: don’t worry about keeping track of which file has which delimiter; CSV.jl will figure it out for you!
Better automatic handling of invalid files: invalid values? wrong number of values on a row? CSV.jl will handle such files gracefully, printing helpful messages about anything unexpected it runs into
Improved performance: great care has been taken to improve performance on several levels; underlying type parsers (provided by Parsers.jl), better data locality and cache friendliness, and greater use of custom Julia structures for efficiency
Enhanced APIs for the CSV.File type: in addition to allowing iteration over rows directly, it now provides getproperty to access efficient read-only columns of the underlying data. If mutable columns are needed, you can copy(col) or use CSV.read(file; copycols=true).
These improvements are in addition to many smaller bugfixes and quality of life enhancements. Great effort has been taken to ensure CSV.jl provides a rich set of features, comparable or better than other world-class csv parsers. (see the feature comparison table below!)
As always, please open issues as you run into bugs or performance issues and we’ll try to address things as quickly as possible. Cheers!
It would be very useful if you (or someone who participated in this optimization) could write up a short blog post or a summary that links the relevant PRs and discusses what lead to practical improvements. Learning from an actual use case that was optimized by experts is always instructive.
The key trade-off is if you want to work with DataFrames.jl later if you want to materialize a DataFrame using DataFrame or DataFrame! - or equivalently if you use copycols=true vs copycols=false keyword argument in CSV.read (I am not trying to give a comprehensive answer to what @Tamas_Papp wants as I have not developed the change but gives my major consideration when using current design from DataFrames.jl perspective - maybe @quinnj can comment more).
I will use the following DataFrame with 10^7 rows as an example (assuming it is written to disk as testfile.txt).
Just an update here for those interested: CSV.jl now has multithreaded parsing support on current #master branch, based on the new multithreading support in Julia version 1.3. I wrote up a quick blogpost with some benchmarking comparisons vs. R’s fread and pandas. I’m happy to report that CSV.jl has built upon previous performance efforts and the multithreaded case is competitive with fread. It’s a great time for csv reading in Julia! Check out the blogpost here: https://quinnj.home.blog/2019/08/24/everyones-favorite-blogpost-csv-benchmarks/.
(also, it looks like I can’t edit the original feature comparison table in the post above, so you’ll have to visualize for yourself that CSV.jl now has a check in the “Multi-threaded parsing” box ).