Query.jl - how to select identical rows differing by one column


#1

Hi, I have a dataframe of columns A,B,C, O.

How can I write a query which shows me those rows which have same pattern in the columns A,B,C but different value for column O ?

example:

A, B, C, O
2, 3, 4, 10
2 ,3, 4, 15
1, 5, 6, 20

the result should show only the first 2 rows.

Thanks


#2

A, B, C, O
2, 3, 4, 10
2 ,3, 4, 15
2, 5, 6, 20
1, 5, 6, 20
1, 5, 6, 21
3, 10, 20, 1

Whats the output for the above?


#3

I think

2, 3, 4, 10
2 ,3, 4, 15
1, 5, 6, 20
1, 5, 6, 21

#4

If that’s the case then it’s group by to find out the count and only keep those groups with count > 1


#5

Wouldn’t that be a SQL-like select distinct A, B, C, O from source? In other words, is just the unique rows of the stream of selected columns.


#6

He wants to drop rows where a,b,c as a group has less than 2 rows


#7

This should do it:

df |>
  @groupby({_.A,_.B,_.C}) |>
  @filter(length(_)>1) |> 
  @mapmany(_, (i,j)->j) |>
  DataFrame

First we group things, then we filter out the groups that have only one element, then we unpack the groups again to go back to a table.

I think I should really add a more convenient way to ungroup tables… @mapmany(_, (i,j)->j) does work, but is a bit unwieldy and hard to remember…


#8

Thanks for asking.

the output for those would be:
A, B, C, O
2, 3, 4, 10
2, 3, 4, 15
1, 5, 6, 20
1, 5, 6 20

the O = 20 is not part of the utput becase A,B,C are not the same.


#9

correct.


#10

what if O does not contain numeric values? Are you suggesting we categorize the column O first ?


#11

I need to see what i am doing wrong with the query you suggested as at the moment it does not give me the wanted results.

maybe because the following result is comming from the query:

A, B, C, O

1, 5 ,9, 20
1, 5 ,9, 20
1, 5 ,9, 20
1, 5, 9, 21

and on a big dataset the lines might still be many. In fact only the last 2 rows would be required. the first 2 are redundant.

Perhaps a bit of background helps.

I have a model for which A, B, C represent the inputs and O is the output of the model. The model specification says that if the same imput pattern is presented, there should be only one outcome on the output.
However due to design “bugs” of the model, the model might have hidden states for which depending from the sequence of the input pattern the output might really be different.

Simulation is used to generate the dataset and clearly I do not design for the hidden states but they are there. Such query should help to find out what are those combination of the input pattern which lead to different output.

The output should be automatically a table with rows forming a unique set (I assume). I believe is an interesting application case.

Many thanks for the help.


#12

Hm, tricky… I guess the right solution here would be that per group that satisfies the length>1 condition, we want to get the unique rows, where unique is based on just looking at the O column. There are multiple difficulties with that at the moment :slight_smile: First, ideally we would have a unique function with a by keyword. But, unfortunately, right now, we don’t… And then I need to think hard how one could combine that with the way I’m handling groups in Query.jl… It might just work or not, I’m not sure right now.

So, no solution for now, I’m afraid. But I’ve opened an issue, maybe I’ll come up with a good way for this at some point: https://github.com/queryverse/Query.jl/issues/233

Thanks for bringing this case up, really super valuable to hear about these cases, it helps a lot to have tough real world cases like this!