[ANN] JuliaDBMeta: metaprogramming tools to manipulate JuliaDB tables

announcement

#1

I’m happy to announce JuliaDBMeta: a set of macros to simplify data manipulations with JuliaDB tables. It can be considered as a “port” of DataFramesMeta to JuliaDB exploiting features of JuliaDB tables:

  • fast row iteration
  • complete type information about columns

It allows to manipulate data referring to columns directly by their name (thus selecting only relevant columns in @groupby and @map operations as it knows what the user will need). Query’s {} syntax for automatically naming columns is also supported:

julia> using JuliaDB, JuliaDBMeta

julia> t = table([1,2,1,2], [4,5,6,7], [0.1, 0.2, 0.3,0.4], names = [:x, :y, :z])
Table with 4 rows, 3 columns:
x  y  z
─────────
1  4  0.1
2  5  0.2
1  6  0.3
2  7  0.4

julia> @groupby t :x {mean(:y) + mean(:z)}
Table with 2 rows, 2 columns:
x  mean(y) + mean(z)
────────────────────
1  5.2
2  6.3

julia> @map t (:x + :y)/:z
4-element Array{Float64,1}:
 50.0   
 35.0   
 23.3333
 22.5   

@apply and @applycombine allow concatenating many of these tasks together (potentially after grouping) and normal JuliaDB operations can be thrown in the mix as well:

julia> @apply t begin
       @where :x == 2
       @transform {:x + :y}
       sort(_, :z)
       end
Table with 2 rows, 4 columns:
x  y  z    x + y
────────────────
2  3  0.2  5
2  3  0.4  5

julia> iris = loadtable(Pkg.dir("JuliaDBMeta", "test", "tables", "iris.csv"));

julia> @applycombine iris :Species begin
           select(_, 1:3, by = i -> i.SepalWidth, rev = true)
           @map {:SepalWidth, Ratio = :SepalLength / :SepalWidth}
           sort(_, by = i -> i.SepalWidth, rev = true)
       end
Table with 9 rows, 3 columns:
Species       SepalWidth  Ratio
─────────────────────────────────
"setosa"      4.4         1.29545
"setosa"      4.2         1.30952
"setosa"      4.1         1.26829
"versicolor"  3.4         1.76471
"versicolor"  3.3         1.90909
"versicolor"  3.2         1.84375
"virginica"   3.8         2.07895
"virginica"   3.8         2.02632
"virginica"   3.6         2.0

Plotting is also supported at the end of this pipeline using StatPlots’ @df macro:

julia> using StatPlots

julia> @apply iris begin
       @where :SepalLength > 4
       @transform {ratio = :PetalLength / :PetalWidth}
       @df scatter(:PetalLength, :ratio, group = :Species)
       end

Distributed tables are partially supported (meaning not all macros work with them already), but I’m working on it and is the next important step for the package.

For more information see the README

Feedback is welcome!


#2

Mini update: I’ve added proper documentation, including a version of the hflights dataset tutorial, and support for distributed tables.

I don’t use distributed tables much myself as my datasets are generally not huge, but if some of the “big data users” could try it and see how well it fares / whether there are performance traps / how good is the usability, that’d be very helpful.

If you want to give it a try, do Pkg.checkout("JuliaDBMeta") as most of the new features are not in the released version.


#3

JuliaDBMeta doesn’t seem to be registered yet - at least not for Windows with Julia 0.6.2.


#4

Was registered very recently, try Pkg.update(); Pkg.add("JuliaDBMeta")


#5

Cool. I’m wondering to what extent it’s possible to create a common “Meta” querying framework, i.e. considering DataFramesMeta.jl. I guess Query.jl already does this, but I don’t know if it has support for JuliaDB yet. I know there have been miscellaneous discussions about the roles of JuliaDB vs DataFrames and whether they will eventually share some code. (I know that eventually DataFrames will get another major overhaul to help tell the compiler about column types, which JuliaDB already does.)


#6

That’s an interesting question. I think what you have in mind sounds more like optimizing Query for JuliaDB. To work with more or less everything, Query only assumes the table can iterate rows, but of course some table implementations can do much more and Query could take advantage of that. Not sure how hard this kind of project would be, but it’d certainly be useful: maybe a good GSOC idea?

Concerning DataFrames versus JuliaDB, I think they are converging to the same optimum from different sides, in that DataFrames started not fully typed, then it became clear that for some operations it’d be better to be fully typed and I think there are plans to create a fully typed wrapper (to avoid code duplication this could maybe be Columns from IndexedTables). JuliaDB started fully typed but to simplify modifying a table the not fully typed column dictionary ColDict was added, which is pretty much like a DataFrame IIUC…

As for DataFramesMeta versus JuliaDBMeta , the implementations are actually quite different. DataFramesMeta is pretty much column-based (meaning, it extracts columns before running the code to circumvent type stability issues and then works on those columns) with the exception of @byrow!, whereas JuliaDBMeta has a few row-wise macros which are fast (at least in theory, haven’t benchmarked yet…) and work out of the box with out-of-core data: implementing this starting from DataFrames would be, I believe, much more challenging. As a downside, due to the full-typing of tables, I need to be careful not to get too high compile times.


#7

I haven’t looked at this yet, but I would like to comment that probably the greatest virtue right now is that columns can be any AbstractVector type and don’t require conversion. I don’t know if Columns accommodates this, but it is highly desirable that there is some dataframe implementation that works this way.


#8

Columns also accepts any AbstractArray and the Array types are also encoded in the type parameters:

julia> Columns(1:3, SVector{3}([1,2,3]), view(rand(10), 1:3))
3-element Columns{Tuple{Int64,Int64,Float64}}:
 (1, 1, 0.628812)
 (2, 2, 0.512944)
 (3, 3, 0.89427) 

julia> typeof(ans)
IndexedTables.Columns{Tuple{Int64,Int64,Float64},Tuple{UnitRange{Int64},StaticArrays.SArray{Tuple{3},Int64,1,3},SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}}}