Pulling array data from IndexedTables or JuliaDB


#1

Typically, to do anything useful with tabular data it has to be converted to a plain-old array, often of floats. And, far worse, usually you have to take the output from doing something to said array back into the data in the form of an additional column. Indeed, I never cease to be amazed at how many obstacles there are to this in sources as diverse as csv’s, SQL and parquet. One thing I really love about DataTables and DataFrames, is that, since they are simple containers of named vectors, it is relatively easy to get them into this type of form. They also have a universal index that is independent of the data they contain, that being just the vector indices pointing to the data.

I’ve been looking at JuliaDB and IndexedTables this morning, and I’m wondering where the path is to numerical data. As far as I can tell so far, the best option is to do some manipulations to the DTable, use gather to get it to an IndexedTable, then use where to create iterators (which, by the way, don’t have indexing defined on them). These iterate over named, tuples, so there’s a little more involved in getting the data out and into an array.

My question is: what is the current thinking on how to best go about getting an array (whether or not it’s implemented yet)?

From what I know so far, it seems that the nicest possible solution would be to manipulate your data into the desired fromat from the DTable, and then make it possible to index this object like a normal array (as efficiently as that may be possible; perhaps there’d be an actual transformation of the data format at this point). From there it’d be straightforward to make something like gather(Array, dtable).


#2

You can get arrays out from a DTable with getdatacol(t, colname) or getindexcol(t, colname) these return a Dagger.jl array on which you can call gather to get a sequential array.

The converse where you create a DTable from Dagger.jl / Distributed arrays is not yet implemented.

Another way to take out a single data column while keeping the indices is to use map. map(tup -> tup.colname, dtable) this is pretty cheap too.