JuliaDB versus

yakir12 · February 19, 2019, 4:19pm

I’ve been using JuliaDB and absolutely loving it . It helped me restructure my data and I’m now able to process my data in about 40 LOC. It all makes sense.

In my application, I’m joining and grouping sources together to distill a table that contains all the data needed for the final step. This final step is costly (each iteration – each row – takes a few seconds). What I’m missing though, is some sort of “piping”:
I’m executing this costly step on each row with a groupby (so it groups the data and then applies the step). The result of which is the final product (the groupby returns a table when it’s done, not per row). But since I have hundreds of rows and each row is slow, stopping the process midway causes all the data to get lost (why stop the process one might ask, good question). What would be good is some DataFrames.groupby that iterates over the groups and has some side-effect (like saving or piping it to a sink). As I’m writing this, I figure I could just create a grouped table (that’s of course very fast), and then in a for-loop iterate over the rows, saving the results as I go. Yea, that’s basically the same.

OK, I’ll post this just in case someone has a better suggestion. Sorry for the somewhat vague post

davidanthoff · February 19, 2019, 5:07pm

Query.jl allows you to hook into the pipeline:

julia> df = DataFrame(a=[1,1,2,3,2], b=rand(5))
5×2 DataFrame
│ Row │ a     │ b        │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.70583  │
│ 2   │ 1     │ 0.109909 │
│ 3   │ 2     │ 0.209058 │
│ 4   │ 3     │ 0.70213  │
│ 5   │ 2     │ 0.85946  │

julia> df |> 
          @groupby(_.a) |>
          @map(i-> begin                                                  
                  @info key(i)                                                                        
                  @info i                                                                             
                  return {a = key(i), b=sum(i.b)}                                                     
              end) |>
          DataFrame                                                                   
[ Info: 1                                                                                     
[ Info: NamedTuple{(:a, :b),Tuple{Int64,Float64}}[(a = 1, b = 0.70583), (a = 1, b = 0.109909)]
[ Info: 2                                                                                     
[ Info: NamedTuple{(:a, :b),Tuple{Int64,Float64}}[(a = 2, b = 0.209058), (a = 2, b = 0.85946)]
[ Info: 3                                                                                     
[ Info: NamedTuple{(:a, :b),Tuple{Int64,Float64}}[(a = 3, b = 0.70213)]                       
3×2 DataFrame                                                                                 
│ Row │ a     │ b        │                                                                    
│     │ Int64 │ Float64  │                                                                    
├─────┼───────┼──────────┤                                                                    
│ 1   │ 1     │ 0.815739 │                                                                    
│ 2   │ 2     │ 1.06852  │                                                                    
│ 3   │ 3     │ 0.70213  │

The @groupby clause iterates Groupings. A Grouping is an AbstractArray that holds the rows that belong to that group, plus you can call key(g) on the group to retrieve the key of that group.

In theory @map should normally be side-effect free, but I guess there is no real harm in adding some diagnostic output like here.

piever · February 19, 2019, 5:38pm

You can probably also do groupby(identity, t, by; select = ...) to get a table where one column has the grouped tables and then work on it. This should be reasonably fast as you are just taking views of the original.

yakir12 · February 19, 2019, 6:42pm

Yea, that’s what I figured as well. I’ll try it out now

bramtayl · February 19, 2019, 8:08pm

Maybe try out LightQuery (preferably the dev version)

yakir12 · February 19, 2019, 8:11pm

Yea, I was think that too!

bramtayl · February 19, 2019, 8:13pm

If you try it out let me know how it goes esp. in terms of performance.

yakir12 · February 19, 2019, 8:15pm

bramtayl · February 22, 2019, 9:37pm

Updates?

yakir12 · February 27, 2019, 2:42pm

Yes! So I ended up creating a new table with groupby, as I mentioned in the beginning, and haven’t YET tried your LightQuery. As an aside, it’s a bit overwhelming to test all the different query-like APIs out there, so things take time, a lot of time. One thing is for sure, the more packages I try the better my data becomes… Funny.

bramtayl · February 27, 2019, 6:50pm

Np. Just trying to pawn the hard work of bench-marking onto someone else.

versipellis · June 17, 2019, 7:30pm

@bramtayl I’m curious, do you know how well LightQuery works with out-of-core JuliaDB datasets? Per the docs, there’s a much more limited subset of processes that can be run once it’s out-of-core, and I’m not entirely sure where LightQuery sits in all of the Query frameworks out there.

bramtayl · June 18, 2019, 12:35am

@versipellis I haven’t tried. LightQuery is in a bit of a state of flux right now, as I’m simultaneously adding SQL support to Query and LightQuery. But I suspect the answer will be yes, provided that

data is pre-sorted
you can index out of order

Topic		Replies	Views
Serious group-by performance issue with Query.jl Data	26	2334	October 13, 2019
Tables package for fast grouping and filtering? Performance package	18	1579	December 8, 2019
Groupby generic table-shaped things General Usage question , package , tables , splitapplycombine	5	115	July 29, 2025
JuliaDB Benchmarks Performance announcement	4	1014	February 12, 2019
Comparing DataFrames native API and Query Data	4	1525	September 1, 2017

JuliaDB versus

Related topics