Hello All
I have a dataframe, the first 5 rows are shown below
5×4 DataFrame
Row │ PointAID_FK PointBID_FK Distance Time
│ Any Any Float64 Float64
─────┼─────────────────────────────────────────────
1 │ 32210470 26402465271 206.12 346.51
2 │ 26402159021 30461660 190.67 282.95
3 │ 32761450 25501286700 77.28 117.18
4 │ 25501310391 25501283840 34.41 20.7
5 │ 26401205831 26402111690 23.67 17.77
.
.
.
And Id like to unstack it into a massive array
When I run the unstack command as follows
unstack(dataframe, :PointAID_FK, :PointBID_FK, :Distance)
I get the following error
ERROR: LoadError: ArgumentError: Non-unique column names produced. Non equal values in `colkey` were mapped to the same column name.
After trying to find the potential source of this error message I have no results.
I am completely at a loss for what the potential problem might be
I’m open to any amount of help you can give
The error message tells you that two different values in the source data have the same string representation, so they produce duplicates in column names.
Here is a minimal example:
julia> df = DataFrame(x=[0, "0"])
2×1 DataFrame
Row │ x
│ Any
─────┼─────
1 │ 0
2 │ 0
julia> unstack(df, :x, :x, :x)
ERROR: ArgumentError: Non-unique column names produced. Non equal values in `colkey` were mapped to the same column name.
And the source of the problem is that 0
(number) and string "0"
have the same string representation ("0"
).
1 Like
Thanks for the reply
I’ll be honest, I’m still a bit confused by your reply
Does this mean that somewhere in my data set there is a PointAID_FK and a PointBID_FK combination that have the same string representation
For example
×4 DataFrame
Row │ PointAID_FK PointBID_FK Distance Time
│ Any Any Float64 Float64
─────┼─────────────────────────────────────────────
1 │ 32210470 26402465271 206.12 346.51
2 │ '32210470' '26402465271' 206.12 346.51
3 │ 26402159021 30461660 190.67 282.95
4 │ 32761450 25501286700 77.28 117.18
5 │ 25501310391 25501283840 34.41 20.7
6 │ 26401205831 26402111690 23.67 17.77
.
.
.
Would this table produce the same error. if I were to try use PointAID_FK and PointBID_FK as column names
Like so
unstack(dataframe, :PointAID_FK, :PointBID_FK, :Distance)
Yes. I.e. the error says only about the dimension used for specification of target column names, but generally yes, you have in your data something like:
26
"26"
And looking at the element type of your column, which is Any
, it might be the case (as usually Any
signals that you have mixed values of different types in your column).
Do the following to see if you have this problem:
df2 = select(df, :some_problematic_column) # just keep the column of interest
unique(df2) # de-duplicate it based on the original values
df2.test = string.(df2.some_problematic_column) # now add string representation
filter(:nrow => >(1), combine(groupby(df2, :test), nrow))) # find values that have duplicate string representation
and you will see the list of values for which you have this situation.
1 Like
Apart from the clarification of the problem that arose, it’s not clear to me what you would do with an unstacked Dataframe like this.
What I have seen done most typically in these cases is the opposite operation: stack.
1 Like
Its a big list of distance and times between points on a 2d plane
The plan is to use this as a pairwise distance matrix to run a traveling salesman optimization problem
I’m about 75% certain the extra processing time spent unstacking will result in a greater performance later when lots of distance look ups are needed between random pairs of points
Final Update from me
@bkamins was correct, it was a duplicated combination of ID’s
This was because my dataset was built from a concatenation of several other datasets, and there happened to be duplicated data between them
I solved this using two steps
Make sure all my ID’s are encoded as string data type, this can be done using the below command. The below command is running the string parsing operation on each value selected from the PointAID_FK column, then overwriting the same column with the result.
df1[!,:PointAID_FK] = string.(df1[!, :PointAID_FK])
The next step was to set the set the combine input argument to be ‘last’. This would mean that my unstack result would prefer whatever value would be the latest value seen from the input table
unstack(dataframe, :PointAID_FK, :PointBID_FK, :Distance, combine = last)