I have a dataset that is 1miil x 70 but with in the rows many columns have duplicate entries. Likewise there are many rows that are unique columns also.
Example
| ID | age | location | street | number |
| 1 | 2 | A | E | 23 |
| 1 | 2 | B | G | 23 |
| 2 | 2 | C | G | 34 |
| 2 | 2 | D | G | 34 |
would like it to look like this.
| ID | age | location | street | number |
| 1 | 2 | (A ,B ) | (E, G) | (23, 23) |
| 2 | 2 | (C, D) | (G, G) | (34, 34) |
I did a small code to try and understand the problem. But Iβm not sure how to do it in a large scale problem. When I donβt know all the columns that will have duplicate values.
df = DataFrame(x = [1, 1, 1, 1, 2, 2, 2, 2], y = [βaβ, βbβ, βcβ, βdβ, βaβ, βbβ, βcβ, βdβ])
gdf = groupby(df, :x)
change_col = gdf[1][:,2]
change_col = reshape(gdf[1][:,2], 1, : )
rowuniq = unique(gdf[1][:,1])
test = [rowuniq, change_col]