Unsure how to solve error message when applying unstack to DataFrame

Lee_Phillips · May 14, 2024, 3:19am

Hello All

I have a dataframe, the first 5 rows are shown below

5×4 DataFrame
 Row │ PointAID_FK  PointBID_FK  Distance  Time
     │ Any          Any          Float64   Float64
─────┼─────────────────────────────────────────────
   1 │ 32210470     26402465271    206.12   346.51
   2 │ 26402159021  30461660       190.67   282.95
   3 │ 32761450     25501286700     77.28   117.18
   4 │ 25501310391  25501283840     34.41    20.7
   5 │ 26401205831  26402111690     23.67    17.77
 .
 .
 .

And Id like to unstack it into a massive array

When I run the unstack command as follows

unstack(dataframe, :PointAID_FK, :PointBID_FK, :Distance)

I get the following error

ERROR: LoadError: ArgumentError: Non-unique column names produced. Non equal values in `colkey` were mapped to the same column name.

After trying to find the potential source of this error message I have no results.

I am completely at a loss for what the potential problem might be

I’m open to any amount of help you can give

bkamins · May 14, 2024, 6:17am

The error message tells you that two different values in the source data have the same string representation, so they produce duplicates in column names.

Here is a minimal example:

julia> df = DataFrame(x=[0, "0"])
2×1 DataFrame
 Row │ x
     │ Any
─────┼─────
   1 │ 0
   2 │ 0

julia> unstack(df, :x, :x, :x)
ERROR: ArgumentError: Non-unique column names produced. Non equal values in `colkey` were mapped to the same column name.

And the source of the problem is that 0 (number) and string "0" have the same string representation ("0").

Lee_Phillips · May 14, 2024, 7:08am

Thanks for the reply

I’ll be honest, I’m still a bit confused by your reply

Does this mean that somewhere in my data set there is a PointAID_FK and a PointBID_FK combination that have the same string representation

For example

×4 DataFrame
 Row │ PointAID_FK  PointBID_FK  Distance  Time
     │ Any          Any          Float64   Float64
─────┼─────────────────────────────────────────────
   1 │ 32210470     26402465271    206.12   346.51
   2 │ '32210470'   '26402465271'  206.12   346.51
   3 │ 26402159021  30461660       190.67   282.95
   4 │ 32761450     25501286700     77.28   117.18
   5 │ 25501310391  25501283840     34.41    20.7
   6 │ 26401205831  26402111690     23.67    17.77
 .
 .
 .

Would this table produce the same error. if I were to try use PointAID_FK and PointBID_FK as column names

Like so

unstack(dataframe, :PointAID_FK, :PointBID_FK, :Distance)

bkamins · May 14, 2024, 7:30am

Yes. I.e. the error says only about the dimension used for specification of target column names, but generally yes, you have in your data something like:

26
"26"

And looking at the element type of your column, which is Any, it might be the case (as usually Any signals that you have mixed values of different types in your column).

Do the following to see if you have this problem:

df2 = select(df, :some_problematic_column) # just keep the column of interest
unique(df2) # de-duplicate it based on the original values
df2.test = string.(df2.some_problematic_column) # now add string representation
filter(:nrow => >(1), combine(groupby(df2, :test), nrow))) # find values that have duplicate string representation

and you will see the list of values for which you have this situation.

rocco_sprmnt21 · May 14, 2024, 8:53am

Apart from the clarification of the problem that arose, it’s not clear to me what you would do with an unstacked Dataframe like this.
What I have seen done most typically in these cases is the opposite operation: stack.

Lee_Phillips · May 14, 2024, 9:27am

Its a big list of distance and times between points on a 2d plane

The plan is to use this as a pairwise distance matrix to run a traveling salesman optimization problem

I’m about 75% certain the extra processing time spent unstacking will result in a greater performance later when lots of distance look ups are needed between random pairs of points

Lee_Phillips · May 14, 2024, 10:33pm

Final Update from me

@bkamins was correct, it was a duplicated combination of ID’s

This was because my dataset was built from a concatenation of several other datasets, and there happened to be duplicated data between them

I solved this using two steps

Make sure all my ID’s are encoded as string data type, this can be done using the below command. The below command is running the string parsing operation on each value selected from the PointAID_FK column, then overwriting the same column with the result.

df1[!,:PointAID_FK] = string.(df1[!, :PointAID_FK])

The next step was to set the set the combine input argument to be ‘last’. This would mean that my unstack result would prefer whatever value would be the latest value seen from the input table

unstack(dataframe, :PointAID_FK, :PointBID_FK, :Distance, combine = last)

Topic		Replies	Views
Side effects of using keyword allowduplicates in the unstack function Data dataframes	1	559	October 19, 2021
Fill option in dataframe unstack not recognized General Usage dataframes	2	412	June 19, 2022
Best practice to unstack a dataframe with a lot of columnes? General Usage question , dataframes	5	1251	March 11, 2020
Unstack using row numbers Data question , dataframes	5	743	August 16, 2021
Dataframe unstack() column order General Usage dataframes	5	578	December 12, 2023

Unsure how to solve error message when applying unstack to DataFrame

Related topics