JuliaDB versus

I’ve been using JuliaDB and absolutely loving it . It helped me restructure my data and I’m now able to process my data in about 40 LOC. It all makes sense.

In my application, I’m joining and grouping sources together to distill a table that contains all the data needed for the final step. This final step is costly (each iteration – each row – takes a few seconds). What I’m missing though, is some sort of “piping”:
I’m executing this costly step on each row with a groupby (so it groups the data and then applies the step). The result of which is the final product (the groupby returns a table when it’s done, not per row). But since I have hundreds of rows and each row is slow, stopping the process midway causes all the data to get lost (why stop the process one might ask, good question). What would be good is some DataFrames.groupby that iterates over the groups and has some side-effect (like saving or piping it to a sink). As I’m writing this, I figure I could just create a grouped table (that’s of course very fast), and then in a for-loop iterate over the rows, saving the results as I go. Yea, that’s basically the same.

OK, I’ll post this just in case someone has a better suggestion. Sorry for the somewhat vague post :stuck_out_tongue_closed_eyes:

6 Likes

Query.jl allows you to hook into the pipeline:

julia> df = DataFrame(a=[1,1,2,3,2], b=rand(5))
5×2 DataFrame
│ Row │ a     │ b        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.70583  │
│ 2   │ 1     │ 0.109909 │
│ 3   │ 2     │ 0.209058 │
│ 4   │ 3     │ 0.70213  │
│ 5   │ 2     │ 0.85946  │

julia> df |> 
          @groupby(_.a) |>
          @map(i-> begin                                                  
                  @info key(i)                                                                        
                  @info i                                                                             
                  return {a = key(i), b=sum(i.b)}                                                     
              end) |>
          DataFrame                                                                   
[ Info: 1                                                                                     
[ Info: NamedTuple{(:a, :b),Tuple{Int64,Float64}}[(a = 1, b = 0.70583), (a = 1, b = 0.109909)]
[ Info: 2                                                                                     
[ Info: NamedTuple{(:a, :b),Tuple{Int64,Float64}}[(a = 2, b = 0.209058), (a = 2, b = 0.85946)]
[ Info: 3                                                                                     
[ Info: NamedTuple{(:a, :b),Tuple{Int64,Float64}}[(a = 3, b = 0.70213)]                       
3×2 DataFrame                                                                                 
│ Row │ a     │ b        │                                                                    
│     │ Int64 │ Float64  │                                                                    
├─────┼───────┼──────────┤                                                                    
│ 1   │ 1     │ 0.815739 │                                                                    
│ 2   │ 2     │ 1.06852  │                                                                    
│ 3   │ 3     │ 0.70213  │                                                                    

The @groupby clause iterates Groupings. A Grouping is an AbstractArray that holds the rows that belong to that group, plus you can call key(g) on the group to retrieve the key of that group.

In theory @map should normally be side-effect free, but I guess there is no real harm in adding some diagnostic output like here.

3 Likes

You can probably also do groupby(identity, t, by; select = ...) to get a table where one column has the grouped tables and then work on it. This should be reasonably fast as you are just taking views of the original.

1 Like

Yea, that’s what I figured as well. I’ll try it out now :slight_smile:

Maybe try out LightQuery (preferably the dev version)

1 Like

Yea, I was think that too!

If you try it out let me know how it goes esp. in terms of performance.

:+1:

Updates?

1 Like

Yes! So I ended up creating a new table with groupby, as I mentioned in the beginning, and haven’t YET tried your LightQuery. As an aside, it’s a bit overwhelming to test all the different query-like APIs out there, so things take time, a lot of time. One thing is for sure, the more packages I try the better my data becomes… Funny.

1 Like

Np. Just trying to pawn the hard work of bench-marking onto someone else.

2 Likes

@bramtayl I’m curious, do you know how well LightQuery works with out-of-core JuliaDB datasets? Per the docs, there’s a much more limited subset of processes that can be run once it’s out-of-core, and I’m not entirely sure where LightQuery sits in all of the Query frameworks out there.

@versipellis I haven’t tried. LightQuery is in a bit of a state of flux right now, as I’m simultaneously adding SQL support to Query and LightQuery. But I suspect the answer will be yes, provided that

  1. data is pre-sorted
  2. you can index out of order
1 Like