ANN: uCSV.jl

uCSV.jl is a µ-sized package for working with delimited text. The default behavior is similar to readcsv from base, but it is extensible enough to cover just about everything you’d expect from more established parsers in other languages as well. It supports Julia 0.6 (current) and the only dependency is Nulls.jl. The package can be found here and the documentation here.

I wrote uCSV.jl because I routinely hit the limitations of existing parsers and didn’t have the time or expertise to understand the code-bases well enough to extend them. I wanted something that was smaller, stuck to using functions from base Julia for robustness, and when it didn’t work, gave back detailed error messages explaining how to fix any problems. I think it will appeal to anyone who has trouble with existing packages and/or those using the readtable and writetable functions in the latest release of DataFrames (which is scheduled for deprecation and removal in the near future).

It does not support other missing data formats (DataArrays & NA, NullableArrays and Nullables, or DataValueArrays and DataValues) out-of-the-box, but if anyone would like to use it with those formats and has trouble doing so, please file an issue and I’d be happy to help. Additionally, If anyone has any general parsing problems, questions, or suggestions for improvement, please open an issue. The package is currently at 100% code coverage and is tested against a diverse set of >75 delimited-text files (the most complete testing suite I’ve found for any parser, regardless of language), hand-curated for ugliness. I plan to extend the tests to cover a curated list of additional ugly datasets from RDatasets in the coming days/weeks to ensure I haven’t failed to account for anything. Everything that it can’t handle (that I’m aware of) is documented in the manual, along with suggested resolutions.

To try and place uCSV.jl in context, existing CSV parsing packages include CSV.jl and TextParse.jl, both of which are actively developed and very capable. In the medium-long term future (think Julia 1.0 release timeline) I aim to explore how well uCSV.jl can complement these tools, rather than compete with them. More specifically, I see the primary strength of CSV.jl as its tight connection with the DataStreams.jl ecosystem for streaming and converting table-like data between formats, and I see the primary strength of TextParse.jl as its very memory efficient generated parsers. I hope uCSV.jl can connect these two frameworks (DataStreams and TextParse’s generated parsers) to make both of them more powerful and accessible for the community, rather than compete with them through the CSV parsing APIs they both currently offer. If anyone has any pointers, advice, or interest in helping with this, feel free to open issues and PRs!

I’d also like to give a shoutout to the JuliaData team for their mentorship over the past several months. Without it, this package wouldn’t exist. This package is also indebted to the user communities of JuliaStats and JuliaData, as the issues everyone has opened regarding other parsers served as the initial testing suite used to build this package from the ground up.

Happy parsing,
Cameron

7 Likes

Pretty cool. Ca it read csv in chunks like the chunked package in R?

Hi @xiaodai. Currently, no. I’m not sure what chunked package from R you are referring to, but uCSV is a pretty simple implementation. It can’t do other tricks like memory map the file on disk, either (yet). Any help implementing these would be very welcome, but if you need features like chunked ingest in the immediate future, I’d recommend checking out the other CSV parsers which may be a better fit for your needs.

@cjprybol Thanks. See the R package chunked , which is based on the LaF package.

I understand that uCSV doesn’t currently have the capability to read chunks. But being able to read a file in chunks, so that each chunk can be processed and then output before reading the next chunk is essential for managing large datasets on disk. This is because loading the whole dataset into memory isn’t an option yet for most people. So they either invest in SAS or get into BIg Data solutions, when a chunking solution would suffice.

The world economy needs a chunking solution that works without big servers. As PCs and laptops get more and more powerful, it’s more and more viable to perform medium data (< Big data) analysis on laptops if only there is a widely used chunking solution!

There’s a lot of functionality being built for things like this in Julia. Using CSV and Query it should be straightforward to implement an operation that will read csv files line by line, apply some processing and saves the result without loading everything into memory. And the JuliaDB.jl package seems exactly focussed on the type of functionality you request.

1 Like