Help with a little query


#1

I don’t generally work with much tabular data, so I could use some input on the best way to do a little CSV processing task that popped up today.

The data I’m working with is here, and is evaluation metrics for a bunch of different algorithms. The columns I care about are metric (one of 4 metrics used to score the algorithms), method (one of 24 algorithms), and score (the metric score for that algorithm).

I’d like to average the score across all the rows with the same metric and method. Bonus points if I can add columns for each of the 4 metrics so i end up with a grid with the 4 metrics along the top and each of the methods as rows.

So far I’ve tried using Query.jl, (loading the CSV as a DataTable) but I’m having trouble grokking it and I’m not sure which issues I hit are failures of my understanding or bugs. I also tried using DataFramesMeta.jl, but it seems that CSV.jl gives a DataFrame with Nullable columns and the my queries weren’t working.

Sorry for the hold-my-hand question, normally I’d try to work further on finding a solution but given the state of flux of the data ecosystem I figured someone with more expertise would be able to point me in the right direction more quickly.


#2

Does this generate your desired output?

using DataFrames

A = readtable("./sisec_mus_2017_full.csv");
B = by(A, [:method_id,:metric_id], df -> DataFrame(score = mean(df[:score])))
C = unstack(B,:method_id,:metric_id,:score)

#3

Yes! That’s perfect. Thanks.

-s


#4

This does what you want (sort of) with Query

sdt = @from i in dt begin
  @group score = i.score by @NT(metric = i.metric, method = i.method) into g
  @select {m = mean(g)}
  @collect DataTable
end
96×1 DataTables.DataTable
│ Row │ m         │
├─────┼───────────┤
│ 1   │ 1.74218   │
│ 2   │ 1.36965   │
...

It loses the column key names though. I don’t have a good idea of how grouping in query is supposed to be done.


#5

Here is how I would do this with Query.jl and DataFrames.jl:

df = @from i in load("sisec_mus_2017_full.csv") begin                    
    @group i.score by @NT(metric = i.metric, method = i.method) into g  
    @select {g.key.metric, g.key.method, score = mean(g)}               
    @collect DataFrame                                                  
end
df = unstack(df, :metric, :score)                                    

There is one thing I hope to make easier in the future: it would be nice if one didn’t have to use the @NT macro in the @group statement, but could instead use {i.metric, i.method} like in a @select statement. I’m tracking this here.

@aaowens the fields by which things were grouped can always be accessed via the key field of the group, i.e. g.key.

Actually, I really appreciate it if people ask their questions early on. It helps me a lot understand where the documentation needs to be improved etc. So please keep those questions coming, even if you haven’t tried for a long time to find a solution on your own! I’m tracking the doc improvements that we need based on this discussion here at https://github.com/davidanthoff/Query.jl/issues/137.


#6

One thing I don’t understand is how to handle a grouping when I want several aggregates. For example, this code doesn’t work

sdt = @from i in load("sisec_mus_2017_full.csv") begin
    @group @NT(score = i.score, track_id = i.track_id) by @NT(metric = i.metric, method = i.method) into g
    @select {g.key.metric, g.key.method, score = mean(g)}
    @collect DataTable
end

Here, I want the mean of both fields of g.

I’m looking for an equivalent to the DataFrames approach, where you just define a function df -> DataFrame(...), and you have access to all the fields for each grouping. This approach is very general, which is useful, For example, I might have some complex function aggregating several columns of the DataFrame into a number. This is easy with DataFrames, but I don’t know how to do it with Query, probably because I don’t have a good understanding of what the output of into g is and how to use it.


#7

g will be an array with the elements being whatever you select in the first argument to the @group statement. So in your latest example, each g will be an array of named tuples with fields score and track_id. In this case you can also just think of it as a table with columns score and track_id.

So mean(g) doesn’t work, because g is an array of named tuples. Currently, one way to solve this is to write @select {g.key.metric, g.key.method, score = mean(j->j.score, g), track_id = mean(j->j.track_id) to get the two means. This works ok for mean because it takes an anonymous function as its first argument, but other aggregation functions work less well.

This PR would enable special syntax in queries that would translate a..b into map(i->i.b, a), so one could write this as @select {score = mean(g..score), track_id = mean(g..track_id)}. I’m probably going to merge this soon, but right now I’m waiting for a little bit of feedback in a julia base issue here whether that kind of syntax (or some alternative) might find its way into julia base, in which case I’d be somewhat hesitant to create special syntax in Query.jl right now that would be incompatible with a future base solution. But we’ll see, I think the a..b stuff would be so useful in queries that I might just go ahead and merge it independently of any base decision. I’ll just need a couple more days to mull this over ;).


#8

Ok, I think I get it. The following query works.

function myfun(A, B)
  mean(A) + mean(x -> x^2, B)
end
sdt = @from i in load("sisec_mus_2017_full.csv") begin
    @group @NT(score = i.score, track_id = i.track_id) by @NT(metric = i.metric, method = i.method) into g
    @select {g.key.metric, m = myfun(map(j -> j.score, g), map(j -> j.track_id, g) ) }
    @collect DataTable
end

This is pretty ugly though, but I see how the … syntax would clean it up a lot.


#9

Yes, I think something like the a..b syntax and {} everywhere for named tuples should make things quite nice, i.e. the above query would then look like this:

sdt = @from i in load("sisec_mus_2017_full.csv") begin
    @group {i.score, i.track_id} by {i.metric, i.method} into g
    @select {g.key.metric, m = myfun(g..score, g..track_id) }
    @collect DataTable
end

which is not too bad, I think.

Note that we are discussing an alternative syntax for the a..b suggestion in https://github.com/JuliaLang/julia/issues/22710 right now. If folks that are users of the whole data stack have opinions about the various design ideas over there (and the importance or not-importance of special syntax etc.), please leave your feedback over there, just a thumbs up or down is useful information for the folks designing things in base.