JuliaDB versus

question
#1

I’ve been using JuliaDB and absolutely loving it . It helped me restructure my data and I’m now able to process my data in about 40 LOC. It all makes sense.

In my application, I’m joining and grouping sources together to distill a table that contains all the data needed for the final step. This final step is costly (each iteration – each row – takes a few seconds). What I’m missing though, is some sort of “piping”:
I’m executing this costly step on each row with a groupby (so it groups the data and then applies the step). The result of which is the final product (the groupby returns a table when it’s done, not per row). But since I have hundreds of rows and each row is slow, stopping the process midway causes all the data to get lost (why stop the process one might ask, good question). What would be good is some DataFrames.groupby that iterates over the groups and has some side-effect (like saving or piping it to a sink). As I’m writing this, I figure I could just create a grouped table (that’s of course very fast), and then in a for-loop iterate over the rows, saving the results as I go. Yea, that’s basically the same.

OK, I’ll post this just in case someone has a better suggestion. Sorry for the somewhat vague post :stuck_out_tongue_closed_eyes:

4 Likes
#2

Query.jl allows you to hook into the pipeline:

julia> df = DataFrame(a=[1,1,2,3,2], b=rand(5))
5×2 DataFrame
│ Row │ a     │ b        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.70583  │
│ 2   │ 1     │ 0.109909 │
│ 3   │ 2     │ 0.209058 │
│ 4   │ 3     │ 0.70213  │
│ 5   │ 2     │ 0.85946  │

julia> df |> 
          @groupby(_.a) |>
          @map(i-> begin                                                  
                  @info key(i)                                                                        
                  @info i                                                                             
                  return {a = key(i), b=sum(i.b)}                                                     
              end) |>
          DataFrame                                                                   
[ Info: 1                                                                                     
[ Info: NamedTuple{(:a, :b),Tuple{Int64,Float64}}[(a = 1, b = 0.70583), (a = 1, b = 0.109909)]
[ Info: 2                                                                                     
[ Info: NamedTuple{(:a, :b),Tuple{Int64,Float64}}[(a = 2, b = 0.209058), (a = 2, b = 0.85946)]
[ Info: 3                                                                                     
[ Info: NamedTuple{(:a, :b),Tuple{Int64,Float64}}[(a = 3, b = 0.70213)]                       
3×2 DataFrame                                                                                 
│ Row │ a     │ b        │                                                                    
│     │ Int64 │ Float64  │                                                                    
├─────┼───────┼──────────┤                                                                    
│ 1   │ 1     │ 0.815739 │                                                                    
│ 2   │ 2     │ 1.06852  │                                                                    
│ 3   │ 3     │ 0.70213  │                                                                    

The @groupby clause iterates Groupings. A Grouping is an AbstractArray that holds the rows that belong to that group, plus you can call key(g) on the group to retrieve the key of that group.

In theory @map should normally be side-effect free, but I guess there is no real harm in adding some diagnostic output like here.

1 Like
#3

You can probably also do groupby(identity, t, by; select = ...) to get a table where one column has the grouped tables and then work on it. This should be reasonably fast as you are just taking views of the original.

1 Like
#4

Yea, that’s what I figured as well. I’ll try it out now :slight_smile:

#5

Maybe try out LightQuery (preferably the dev version)

1 Like
#6

Yea, I was think that too!

#7

If you try it out let me know how it goes esp. in terms of performance.

#8

:+1:

#9

Updates?

1 Like
#10

Yes! So I ended up creating a new table with groupby, as I mentioned in the beginning, and haven’t YET tried your LightQuery. As an aside, it’s a bit overwhelming to test all the different query-like APIs out there, so things take time, a lot of time. One thing is for sure, the more packages I try the better my data becomes… Funny.

1 Like
#11

Np. Just trying to pawn the hard work of bench-marking onto someone else.

2 Likes