[ANN] CSV.jl 0.9 Release

Hello all,

We’re pleased to present the 0.9 release of CSV.jl.

Highlights include:

  • Big internals refactoring to simplify code, make maintenance/future feature work easier, and make current features a little more robust in terms of multithreading/performance
  • Support for two new custom string types builtin to parsing: InlineStrings and PosLenStrings. InlineStrings are a fixed-width string type defined as a primitive type which allows for inline storage in Vectors and various processing efficiencies from that representation (but which make end up taking up more space when a column has higher variance in the length of strings). PosLenStrings are a new “lazy string” representation where a reference to the original csv input is kept and a PosLenString just points to the range of bytes in the source. These will allow more flexibility depending on use-case/workload to avoid excessive allocations that can result when using regular Strings on very large files
  • Cleanup, review, and simplification of keyword arguments supported by CSV.File/CSV.Rows
  • Automatic support for reading gzipped inputs, with the ability for decompression to be done to a temporary file (the default) or in memory (by passing buffer_in_memory=true), as well as support for writing gzip compressed outputs in CSV.write by passing compress=true
  • Big overhaul of the CSV.jl docs, including a massively updated “intro” section that walks through all the various APIs provided by the package (turns out people have a hard time knowing about really useful things when they aren’t well documented!); big thanks to all those who contributed feedback (and code) to help improve overall documentation
  • New functionality to pass a Vector of inputs (be they file names, IO objects, byte vectors, etc.) to CSV.File and they’ll all be vertically concatenated and returned as a single CSV.File object (currently has strict requirements on matching schemas from all inputs, but this will be relaxed in the future)

This shouldn’t be a breaking release at all, but several deprecations were introduced in the keyword argument cleanup. There were also some additional restrictions/type constraints placed on various arguments, so if you do notice things not working that worked before, please open an issue and we can figure out how to fix/support.

Why isn’t this the 1.0 release??

Good question: we’re really close. We wanted to provide one more release with the number of deprecations to allow a transition period. With the big internals refactoring, new string support, and overall amount of work that went into this release, we also just want to let bugs shake out a bit, iron out the wrinkles, and then put out a more polished 1.0 release in the very near future. It’s currently only being planned as a breaking release by removing the keyword argument deprecations, baring any unexpected issues.

Thanks again to all those willing to provide feedback, file issues, or just express appreciation at the repo or #data slack channel. We look forward to hearing from you!

-Jacob Quinn & JuliaData maintainers

52 Likes

[Note there was briefly an issue with the newest docs deploying to csv.juliadata.org/stable, so if you went there and the docs didn’t look updated, you were right! All should be fixed now]

2 Likes

@quinnj
As a side note, a new SQLite.jl release should be made to support WeakRefStrings.jl version 1.0, so it won’t block update of CSV.jl.

Thanks for another great release!

Since you’re in the business of polishing the keywords :smile: : downcast seems like a weird name to me (I think it normally means something else, e.g. converting from Integer to Int32 rather than Int64 to Int32) and the name doesn’t really convey the point (it emphasizes “more specialized” type, rather than “smaller”). Maybe shrinktypes or just shrink?

Is this multithreaded or singlethreaded?

EDIT: never mind. I’ve found here:

When a Vector of inputs is provided, the column names and types of each separate file/input must match to be vertically concatenated. Separate threads will
be used to parse each input, which will each parse their input using just the single thread. The results of all threads are then vertically concatenated using
ChainedVectors to lazily concatenate each thread’s columns.