DataFrame: Cannot have duplicated names for indices

Hello everyone,

I’d just spent 5 hours on a simple issue with joining two dataframes.
I’ve basically two dataframes, one with a column of sectors and value, the other with new sectors mapped to the previous one as well as weights associated.

My issue is that I can’t perform leftjoin on the column sector in my real example, due to the error message “Cannot have duplicated names for indices”.

I’m not able to reproduce a MWE giving the same error message, but here an example of what I want to perform:

df_initial = DataFrame(code = ["a","b","c"],country = ["AU","AU","AU"] ,value = [10, 68, 50])
insertcols!(df_initial, :sector => string.(df_initial[:,:code], "-",df_initial[:,:country]))
df_weights = DataFrame(code = ["a","a","a","b","b","c"], country = ["AU","AU","AU","AU","AU","AU"],new_sector = ["new-1","new-2","new-3","new-4","new-2","new-1"], weights = [0.2, 0.2, 0.6, 0.4, 0.6, 1])
insertcols!(df_weights, :sector => string.(df_weights[:,:code], "-",df_weights[:,:country]))

df_test = leftjoin(df_weights[:,[:sector,:new_sector, :weights]], df_initial[:,[:sector,:value]], on = :sector)

Have you any idea of what can be the source of a potential “duplicated names for indices” message?

you could add ;makeunique=true to the join call and see which columns clash

Unfortunately it can’t help, I still have the same issue using makeunique = true.

Indeed, this could help if the issue was about column names, but it isn’t. It seems that duplicated keys for merging aren’t allowed.

I can see it by using

leftjoin(df, df_2, on = :sector, validate = (true, true))

which gives the following error (in my real example):

Merge key(s) in df1 are not unique. df1 contains 1452 duplicate keys: (sector = "AUT-A01",), ..., (sector = "ROW-J59_J60",).

But I really don’t understand why I can’t have nonunique keys in this real example while it is allowed in my MWE above…

Can you please show a full stack trace. This is an error not in DataFrames.jl, but in the package that provides vectors for your columns. Most likely some of your columns come from https://github.com/davidavdav/NamedArrays.jl and this causes an error (such columns do not allow for duplicate rows).

You’re right! Some of my columns come from a NamedArray initially.

Here is indeed a MWE giving the error message I have:


using DataFrames
using NamedArrays

initial = NamedArray([5748.61], ["AUT-A01"])
df_initial = DataFrame(:sector => names(initial,1), :value => initial[:,1])
df_weights = DataFrame(sector = ["AUT-A01","AUT-A01","AUT-A01","AUT-A01","AUT-A01"],
    new = ["a","b","c","d","e"],
 weights = [0.2, 0.2, 0.4, 0.1, 0.1]
)

test = leftjoin(df_weights, df_initial, on = :sector)

you need to convert these columns to Vector because you have duplicates.

3 Likes

Thank you so much!

For those wondering, this is the solution with the MWE thanks to @bkamins:


using DataFrames
using NamedArrays

initial = NamedArray([5748.61], ["AUT-A01"])
df_initial = DataFrame(:sector => vec(names(initial,1)), :value => vec(initial[:,1]))
df_weights = DataFrame(sector = ["AUT-A01","AUT-A01","AUT-A01","AUT-A01","AUT-A01"],
    new = ["a","b","c","d","e"],
 weights = [0.2, 0.2, 0.4, 0.1, 0.1]
)

test = leftjoin(df_weights, df_initial, on = :sector)
1 Like

7 posts were split to a new topic: Column types in DataFrames vs. InMemoryDatasets