Changing many rows to single row julia1.5.3

I have a dataset that is 1miil x 70 but with in the rows many columns have duplicate entries. Likewise there are many rows that are unique columns also.

Example
`| ID | age | location | street | number |`
`| 1 | 2 | A | E | 23 |`
`| 1 | 2 | B | G | 23 |`
` | 2 | 2 | C | G | 34 |`
` | 2 | 2 | D | G | 34 |`

would like it to look like this.

` | ID | age | location | street | number |`
` | 1 | 2 | (A ,B ) | (E, G) | (23, 23) |`
` | 2 | 2 | (C, D) | (G, G) | (34, 34) |`

I did a small code to try and understand the problem. But Iβm not sure how to do it in a large scale problem. When I donβt know all the columns that will have duplicate values.

df = DataFrame(x = [1, 1, 1, 1, 2, 2, 2, 2], y = [βaβ, βbβ, βcβ, βdβ, βaβ, βbβ, βcβ, βdβ])
gdf = groupby(df, :x)
change_col = gdf[1][:,2]
change_col = reshape(gdf[1][:,2], 1, : )
rowuniq = unique(gdf[1][:,1])
test = [rowuniq, change_col]

Iβm not sure I understand what the expected result is, but the example I illustrate could be a good approximation of what you are looking for or, hopefully, a starting point in the right direction.

Itβs certainly not clear what you mean by referring to a situation where you donβt know which columns have repeating data.
I believe that you have to choose, on the basis of the objective you have, the columns on which to group the repeated values.

1 Like

Hello rocco, Sorry this is confusing.

What I have is many folder mark by years in each folder I have many excel files that Iβm combining. Not all the years have the same type of column structure. I have a dataset that is around about (1miil x 70) + but with in the rows many columns have duplicate entries. Likewise there are many rows that are unique columns also. When I look at the head I see a lot of duplicate entries but each rows is unique so there is some unique columns somewhere in 70 +. So I need some way that I can get duplicates out but keep the unique values. Because I have so many columns itβs hard to figure out where all the unique values will be likewise for the duplicates. I modified the example I see I made some takes last time.

Example
`| ID | age | location | street | number |`
`| 1 | 2 | A | E | 23 |`
`| 1 | 2 | B | G | 23 |`
`| 2 | 2 | C | G | 34 |`
`| 2 | 2 | D | G | 34 |`

would like it to look like this.

`| ID | age | location | street | number |`
`| 1 | 2 | (A ,B ) | (E, G) | (23) |`
`| 2 | 2 | (C, D) | (G) | (34) |`

This is kind of ugly and possibly not fast, but itβll be a starting point:

``````using DataFrames
df = DataFrame(x = [1, 1, 1, 1, 2, 2, 2, 2], y = ["a", "b", "a", "b", "c", "d", "c", "d"])
df2 = DataFrame(x = Int[], y = Tuple[])

groups = groupby(df, :x)
for group in groups
x = only(unique(group.x))
y = Tuple(unique(group.y))
df2 = vcat(df2, DataFrame(x = x, y = y))
end

println(df2)
2Γ2 DataFrame
Row β x      y
β Int64  Tuple
ββββββΌβββββββββββββββββββ
1 β     1  ("a", "b")
2 β     2  ("c", "d")
``````

You porobably want the following

``````julia> combine(groupby(df, :x), names(df, Not(:x)) .=> Ref β Tuple β unique .=> names(df, Not(:x)))
``````

`renamecols=false` now keeps the source column names also.

And `Ref` is not needed as `Tuple` is by default left βas isβ.

1 Like

Thank you @rocco_sprmnt21, @kmundnic, @pdeffebach, @bkamins for all the help.

@pdeffebach you Blow my mind what is this part of the code (.=> Ref β Tuple β unique) also what is this β?

Itβs function composition (types `\circ<tab>`. https://en.wikipedia.org/wiki/Function_composition

@Oscar_Smith Thank you.

1 Like