Query.jl - how to select identical rows differing by one column

elshera · January 7, 2019, 9:26pm

Hi, I have a dataframe of columns A,B,C, O.

How can I write a query which shows me those rows which have same pattern in the columns A,B,C but different value for column O ?

example:

A, B, C, O
2, 3, 4, 10
2 ,3, 4, 15
1, 5, 6, 20
…

the result should show only the first 2 rows.

Thanks

xiaodai · January 7, 2019, 11:21pm

A, B, C, O
2, 3, 4, 10
2 ,3, 4, 15
2, 5, 6, 20
1, 5, 6, 20
1, 5, 6, 21
3, 10, 20, 1

Whats the output for the above?

Seif_Shebl · January 7, 2019, 11:26pm

I think

2, 3, 4, 10
2 ,3, 4, 15
1, 5, 6, 20
1, 5, 6, 21

xiaodai · January 7, 2019, 11:52pm

If that’s the case then it’s group by to find out the count and only keep those groups with count > 1

Nosferican · January 7, 2019, 11:59pm

Wouldn’t that be a SQL-like select distinct A, B, C, O from source? In other words, is just the unique rows of the stream of selected columns.

xiaodai · January 8, 2019, 12:11am

He wants to drop rows where a,b,c as a group has less than 2 rows

davidanthoff · January 8, 2019, 1:17am

This should do it:

df |>
  @groupby({_.A,_.B,_.C}) |>
  @filter(length(_)>1) |> 
  @mapmany(_, (i,j)->j) |>
  DataFrame

First we group things, then we filter out the groups that have only one element, then we unpack the groups again to go back to a table.

I think I should really add a more convenient way to ungroup tables… @mapmany(_, (i,j)->j) does work, but is a bit unwieldy and hard to remember…

elshera · January 8, 2019, 12:40pm

Thanks for asking.

the output for those would be:
A, B, C, O
2, 3, 4, 10
2, 3, 4, 15
1, 5, 6, 20
1, 5, 6 20

the O = 20 is not part of the utput becase A,B,C are not the same.

elshera · January 8, 2019, 12:40pm

correct.

elshera · January 8, 2019, 12:41pm

what if O does not contain numeric values? Are you suggesting we categorize the column O first ?

elshera · January 8, 2019, 12:41pm

I need to see what i am doing wrong with the query you suggested as at the moment it does not give me the wanted results.

maybe because the following result is comming from the query:

A, B, C, O

1, 5 ,9, 20
1, 5 ,9, 20
1, 5 ,9, 20
1, 5, 9, 21

and on a big dataset the lines might still be many. In fact only the last 2 rows would be required. the first 2 are redundant.

Perhaps a bit of background helps.

I have a model for which A, B, C represent the inputs and O is the output of the model. The model specification says that if the same imput pattern is presented, there should be only one outcome on the output.
However due to design “bugs” of the model, the model might have hidden states for which depending from the sequence of the input pattern the output might really be different.

Simulation is used to generate the dataset and clearly I do not design for the hidden states but they are there. Such query should help to find out what are those combination of the input pattern which lead to different output.

The output should be automatically a table with rows forming a unique set (I assume). I believe is an interesting application case.

Many thanks for the help.

davidanthoff · January 10, 2019, 6:36am

Hm, tricky… I guess the right solution here would be that per group that satisfies the length>1 condition, we want to get the unique rows, where unique is based on just looking at the O column. There are multiple difficulties with that at the moment First, ideally we would have a unique function with a by keyword. But, unfortunately, right now, we don’t… And then I need to think hard how one could combine that with the way I’m handling groups in Query.jl… It might just work or not, I’m not sure right now.

So, no solution for now, I’m afraid. But I’ve opened an issue, maybe I’ll come up with a good way for this at some point: Find a solution for selecting unique group members · Issue #233 · queryverse/Query.jl · GitHub

Thanks for bringing this case up, really super valuable to hear about these cases, it helps a lot to have tough real world cases like this!

Topic		Replies	Views
Filtering dataframe for unique rows with respect one of column New to Julia question , dataframes	1	48	July 18, 2024
Trouble creating new column in grouped object Data query	1	423	August 26, 2019
Comparing DataFrames native API and Query Data	4	1522	September 1, 2017
Find unique row in DataFrame General Usage	5	1642	May 17, 2018
Changing many rows to single row julia1.5.3 Data question	8	586	December 13, 2020

Query.jl - how to select identical rows differing by one column

Related topics