Why do you use JuliaDB?

I was kinda waiting for CSV.jl to get to as good as it is now before I start DiskFrame.jl which is a sister project to R’s diskframe.com (I am the author). It’s a disk-based data manipulation system. It’s on my laptop atm.

I waited for CSV.jl to get good because the entry point to these large data systems is almost always loading some CSVs and if that is done right then adoption and frustration level cab become issues.

At various points I have high hopes for JuliaDB.jl but it never really worked for me. I tried to submit some issues, but didn’t much response. Most of the issues were to do with reading Fannie Mae data which I recently found out has direct download link now thanks to Rapids.ai see https://docs.rapids.ai/datasets/mortgage-data

I note about 5 outstanding issues I have submitted. So I guess there isn’t much active development and maintenance.

As a rule, in general it is always necessary to have really strong CSV import capabilities for software designed to deal with largish data volumes. So before the release of CSV.jl 0.5.14 it would have been hard to get lots of traction. Text Parse is competent but it isn’t as battle hardened and so can “fail” on data with lots of edge case like the Fannie Mae. Also, it took me a while to find out that Juliadb didn’t support rechunk ing CSVs once read and that the CSV needs to have been chunked into the desired number of chunks before reading. To me, this is an aspect of usability that should receive the most attention once development picks up on JuliaDB again. This is because CSVs is The most common entry point to a package like JuliaDB.jl, and if that doesn’t work well for a wide selection of CSVs then it’s hard to get reaction on adoption.

I hope that by writing down my perspectives.

Can I ask everyone, what do you se yourself using JuliaDB for?

8 Likes

For rechunking CSVs automatically, you could attempt to finish my PR (itself building on great work by others), which would acheive what you want: https://github.com/JuliaComputing/JuliaDB.jl/pull/288. Adding a CSV.jl frontend to JuliaDB is also a great idea, IMO.

The main issue I have with JuliaDB is the huge compilation time for reading wide CSV tables with missing values. The usage of StructArrays in IndexedTables seems to be the cause of this behavior, and I feel that it would be well worth it to change to another backing data structure which doesn’t encode the column types in the final table type, at least as the default for interactive usage.

3 Likes

I might be alone on this, but for me JuliaDB wad mostly about the excellent API. It was very easy for me to manipulate my data with all their functions…

I was always curious are you able to share the data? How did you ingest the data?

Not sure if this is useful, but JuliaDB has a basic “untyped table” type ColDict. It is used to implement operation that mutate columns efficiently. A possible approach would be to load things as a ColDict and only use a fully typed version when a small subset of columns is needed for an operation (which happens with select keyword in IndexedTables or with macro magic in JuliaDBMeta).

There was even some discussion to use a DataFrame instead of ColDict (so that it would be thing that can be manipulate more easily by the user), which now that DataFrames is more lightweight, seems feasible. This would need IndexedTables to drop the “nameless table” (where columns are just numbered) and DataFrames to implement a notion of primary keys.

:100:

1 Like

I can’t really (it’s not my data, and it’s research data). But it’s basically a bunch of tables in csv files.

I loaded, joined, grouped, and mapped (not sure if that’s what you meant).

OK so they in csv format and I assume each individual file is small maybe no more than 200mb and also there isn’t much missing and the number of columns is small? This seems to be the sweet spot for juliadb. Till now, I still can’t load the Fannie Mae data into JuliaDB. Also, text parse can’t deal with csv that came from Fannie Mae as well as csv.jl

All true (though there’s missing).

I would really like something that accepted basically any function from filename (or IO) and returned a Table.
E.g. I had something that was effectively 31GB of JSON records, 1 per line, broken up into about 64 files.
And I have code that would load any of these using JSON.jl and would return a RowTable
(and now JSONTables.jl is a thing)

So, I heavily use JuliaDB.jl/IndexedTables.jl primarily because I need to run a lot of groupby operations in parallel. I find that vanilla groupby and joins of dataframes fail on multiple-processors routinely but if I change my database structure to an IndexedTable and perform the indexedtable versions of groupby and join, things work fine.

I really like the API of IndexedTables too and the tutorial is easy to understand and build of on. Oh, and I also use a lot of macros from JuliaDBMeta