Changing many rows to single row julia1.5.3

orthoeng · December 10, 2020, 5:02pm

I have a dataset that is 1miil x 70 but with in the rows many columns have duplicate entries. Likewise there are many rows that are unique columns also.

Example
| ID | age | location | street | number |
| 1 | 2 | A | E | 23 |
| 1 | 2 | B | G | 23 |
| 2 | 2 | C | G | 34 |
| 2 | 2 | D | G | 34 |

would like it to look like this.

| ID | age | location | street | number |
| 1 | 2 | (A ,B ) | (E, G) | (23, 23) |
| 2 | 2 | (C, D) | (G, G) | (34, 34) |

I did a small code to try and understand the problem. But I’m not sure how to do it in a large scale problem. When I don’t know all the columns that will have duplicate values.

df = DataFrame(x = [1, 1, 1, 1, 2, 2, 2, 2], y = [“a”, “b”, “c”, “d”, “a”, “b”, “c”, “d”])
gdf = groupby(df, :x)
change_col = gdf[1][:,2]
change_col = reshape(gdf[1][:,2], 1, : )
rowuniq = unique(gdf[1][:,1])
test = [rowuniq, change_col]

rocco_sprmnt21 · December 12, 2020, 1:22pm

I’m not sure I understand what the expected result is, but the example I illustrate could be a good approximation of what you are looking for or, hopefully, a starting point in the right direction.

It’s certainly not clear what you mean by referring to a situation where you don’t know which columns have repeating data.
I believe that you have to choose, on the basis of the objective you have, the columns on which to group the repeated values.

orthoeng · December 12, 2020, 5:27pm

Hello rocco, Sorry this is confusing.

What I have is many folder mark by years in each folder I have many excel files that I’m combining. Not all the years have the same type of column structure. I have a dataset that is around about (1miil x 70) + but with in the rows many columns have duplicate entries. Likewise there are many rows that are unique columns also. When I look at the head I see a lot of duplicate entries but each rows is unique so there is some unique columns somewhere in 70 +. So I need some way that I can get duplicates out but keep the unique values. Because I have so many columns it’s hard to figure out where all the unique values will be likewise for the duplicates. I modified the example I see I made some takes last time.

Example
| ID | age | location | street | number |
| 1 | 2 | A | E | 23 |
| 1 | 2 | B | G | 23 |
| 2 | 2 | C | G | 34 |
| 2 | 2 | D | G | 34 |

would like it to look like this.

| ID | age | location | street | number |
| 1 | 2 | (A ,B ) | (E, G) | (23) |
| 2 | 2 | (C, D) | (G) | (34) |

kmundnic · December 12, 2020, 5:55pm

This is kind of ugly and possibly not fast, but it’ll be a starting point:

using DataFrames
df = DataFrame(x = [1, 1, 1, 1, 2, 2, 2, 2], y = ["a", "b", "a", "b", "c", "d", "c", "d"])
df2 = DataFrame(x = Int[], y = Tuple[])

groups = groupby(df, :x)
for group in groups
    x = only(unique(group.x))
    y = Tuple(unique(group.y))
    df2 = vcat(df2, DataFrame(x = x, y = y))
end

println(df2)
2×2 DataFrame
 Row │ x      y
     │ Int64  Tuple
─────┼───────────────────
   1 │     1  ("a", "b")
   2 │     2  ("c", "d")

pdeffebach · December 12, 2020, 6:34pm

You porobably want the following

julia> combine(groupby(df, :x), names(df, Not(:x)) .=> Ref ∘ Tuple ∘ unique .=> names(df, Not(:x)))

bkamins · December 12, 2020, 7:01pm

renamecols=false now keeps the source column names also.

And Ref is not needed as Tuple is by default left “as is”.

orthoeng · December 13, 2020, 3:57am

Thank you @rocco_sprmnt21, @kmundnic, @pdeffebach, @bkamins for all the help.

@pdeffebach you Blow my mind what is this part of the code (.=> Ref ∘ Tuple ∘ unique) also what is this ∘?

Oscar_Smith · December 13, 2020, 5:52am

It’s function composition (types \circ<tab>. Function composition - Wikipedia

orthoeng · December 13, 2020, 11:51pm

@Oscar_Smith Thank you.

Topic		Replies	Views
Groupby / reshaping dataframe with unique values Data data , dataframes	17	1471	December 19, 2020
Transform DataFrame into a single row by adding row contents as new columns New to Julia dataframes	3	321	May 1, 2021
If I have a DataFrame and want to get `unique` columns but treat each column as a General Usage	1	866	February 23, 2021
Manipulation of dataframe rows upon repeated values in a given column New to Julia	5	724	April 15, 2021
Delete duplicate rows in a DataFrame New to Julia dataframes	10	6063	June 22, 2023

Changing many rows to single row julia1.5.3

Related topics