I was kinda waiting for CSV.jl to get to as good as it is now before I start DiskFrame.jl which is a sister project to R’s diskframe.com (I am the author). It’s a disk-based data manipulation system. It’s on my laptop atm.
I waited for CSV.jl to get good because the entry point to these large data systems is almost always loading some CSVs and if that is done right then adoption and frustration level cab become issues.
At various points I have high hopes for JuliaDB.jl but it never really worked for me. I tried to submit some issues, but didn’t much response. Most of the issues were to do with reading Fannie Mae data which I recently found out has direct download link now thanks to Rapids.ai see https://docs.rapids.ai/datasets/mortgage-data
I note about 5 outstanding issues I have submitted. So I guess there isn’t much active development and maintenance.
As a rule, in general it is always necessary to have really strong CSV import capabilities for software designed to deal with largish data volumes. So before the release of CSV.jl 0.5.14 it would have been hard to get lots of traction. Text Parse is competent but it isn’t as battle hardened and so can “fail” on data with lots of edge case like the Fannie Mae. Also, it took me a while to find out that Juliadb didn’t support rechunk ing CSVs once read and that the CSV needs to have been chunked into the desired number of chunks before reading. To me, this is an aspect of usability that should receive the most attention once development picks up on JuliaDB again. This is because CSVs is The most common entry point to a package like JuliaDB.jl, and if that doesn’t work well for a wide selection of CSVs then it’s hard to get reaction on adoption.
I hope that by writing down my perspectives.
Can I ask everyone, what do you se yourself using JuliaDB for?