Changing many rows to single row julia1.5.3

I have a dataset that is 1miil x 70 but with in the rows many columns have duplicate entries. Likewise there are many rows that are unique columns also.

Example
| ID | age | location | street | number |
| 1 | 2 | A | E | 23 |
| 1 | 2 | B | G | 23 |
| 2 | 2 | C | G | 34 |
| 2 | 2 | D | G | 34 |

would like it to look like this.

| ID | age | location | street | number |
| 1 | 2 | (A ,B ) | (E, G) | (23, 23) |
| 2 | 2 | (C, D) | (G, G) | (34, 34) |

I did a small code to try and understand the problem. But I’m not sure how to do it in a large scale problem. When I don’t know all the columns that will have duplicate values.

df = DataFrame(x = [1, 1, 1, 1, 2, 2, 2, 2], y = [β€œa”, β€œb”, β€œc”, β€œd”, β€œa”, β€œb”, β€œc”, β€œd”])
gdf = groupby(df, :x)
change_col = gdf[1][:,2]
change_col = reshape(gdf[1][:,2], 1, : )
rowuniq = unique(gdf[1][:,1])
test = [rowuniq, change_col]

I’m not sure I understand what the expected result is, but the example I illustrate could be a good approximation of what you are looking for or, hopefully, a starting point in the right direction.

It’s certainly not clear what you mean by referring to a situation where you don’t know which columns have repeating data.
I believe that you have to choose, on the basis of the objective you have, the columns on which to group the repeated values.

1 Like

Hello rocco, Sorry this is confusing.

What I have is many folder mark by years in each folder I have many excel files that I’m combining. Not all the years have the same type of column structure. I have a dataset that is around about (1miil x 70) + but with in the rows many columns have duplicate entries. Likewise there are many rows that are unique columns also. When I look at the head I see a lot of duplicate entries but each rows is unique so there is some unique columns somewhere in 70 +. So I need some way that I can get duplicates out but keep the unique values. Because I have so many columns it’s hard to figure out where all the unique values will be likewise for the duplicates. I modified the example I see I made some takes last time.

Example
| ID | age | location | street | number |
| 1 | 2 | A | E | 23 |
| 1 | 2 | B | G | 23 |
| 2 | 2 | C | G | 34 |
| 2 | 2 | D | G | 34 |

would like it to look like this.

| ID | age | location | street | number |
| 1 | 2 | (A ,B ) | (E, G) | (23) |
| 2 | 2 | (C, D) | (G) | (34) |

This is kind of ugly and possibly not fast, but it’ll be a starting point:

using DataFrames
df = DataFrame(x = [1, 1, 1, 1, 2, 2, 2, 2], y = ["a", "b", "a", "b", "c", "d", "c", "d"])
df2 = DataFrame(x = Int[], y = Tuple[])

groups = groupby(df, :x)
for group in groups
    x = only(unique(group.x))
    y = Tuple(unique(group.y))
    df2 = vcat(df2, DataFrame(x = x, y = y))
end

println(df2)
2Γ—2 DataFrame
 Row β”‚ x      y
     β”‚ Int64  Tuple
─────┼───────────────────
   1 β”‚     1  ("a", "b")
   2 β”‚     2  ("c", "d")

You porobably want the following

julia> combine(groupby(df, :x), names(df, Not(:x)) .=> Ref ∘ Tuple ∘ unique .=> names(df, Not(:x)))

renamecols=false now keeps the source column names also.

And Ref is not needed as Tuple is by default left β€œas is”.

1 Like

Thank you @rocco_sprmnt21, @kmundnic, @pdeffebach, @bkamins for all the help.

@pdeffebach you Blow my mind what is this part of the code (.=> Ref ∘ Tuple ∘ unique) also what is this ∘?

It’s function composition (types \circ<tab>. https://en.wikipedia.org/wiki/Function_composition

@Oscar_Smith Thank you.

1 Like