DataFrame; create new column of integers representing strings in other column

Hi all,

Just a real bone question regarding DataFrames and how to create a new column containing integers to represent each unique string in another column. Sorry if my example is a bit rubbish, I’m new to Julia and not the best coder to begin with… Anyway, I have a DataFrame with a String31 column containing 53 unique site names. I am attempting to create another column containing integers (1:53), with each integer representing each unique site, if that makes sense. As an example (I don’t know how to add code):

test = DataFrame(site = [“site_one”, “site_two”, “site_three”], charcoal = [0.789, 0.232, 0.451])

gives me:

3×2 DataFrame
Row │ site charcoal
│ String Float64
──┼──────────────────────
1 │ site_one 0.789
2 │ site_two 0.232
3 │ site_three 0.451

So far, I have been trying:

test[:, :site_int] = map(test[:, :site]) do b
if b == “site_one”
1
elseif b == “site_two”
2
elseif b == “site_three”
3
else
missing
end
end

And it seems to work, but I can’t work out how to loop it/vectorise it to number each of the 53 sites; I am guessing there is a more sophisticated way of achieving this than just typing out each site name and assigning a value.

I hope this makes sense and someone could give me a hand in working it out. Any help greatly appreciated.

Cheers

Gregg

Sorry, I forgot to give an example of what I want:

3×3 DataFrame
Row │ site charcoal site_int
│ String Float64 Int64
─────┼────────────────────────────────
1 │ site_one 0.789 1
2 │ site_two 0.232 2
3 │ site_three 0.451 3

The standard way to do it is to use CategoricalArrays.jl:

julia> using CategoricalArrays

julia> test.site_int = levelcode.(categorical(test.site))
3-element Vector{Int64}:
 1
 3
 2

Note that:

  1. by default levels are sorted in ascending order (but you can use any order you like - see the levels! function).
  2. It is likely that you actually want CategoricalVector created with categorical(test.site) and not an integer code. What do you want to use these integer codes for?

If you just want integer indices (and do not need/like CategoricalArrays.jl funtionality) you can do:

julia> transform!(groupby(test, :site), groupindices => :site_int)
3×3 DataFrame
 Row │ site        charcoal  site_int
     │ String      Float64   Int64
─────┼────────────────────────────────
   1 │ site_one       0.789         1
   2 │ site_two       0.232         2
   3 │ site_three     0.451         3

or just

julia> uv = unique(test.site)
3-element Vector{String}:
 "site_one"
 "site_two"
 "site_three"

julia> dict = Dict(uv .=> 1:length(uv))
Dict{String, Int64} with 3 entries:
  "site_two"   => 2
  "site_one"   => 1
  "site_three" => 3

julia> [dict[x] for x in test.site]
3-element Vector{Int64}:
 1
 2
 3
5 Likes

Hi, that’s boss, cheers man, Yeah, the site names are not ordinal, I just want to represent them with a number. I am trying to recode some of my old hierarchical models from R in Julia/Turing. This model, in R, is a GLMM with site as a random effect, though when I try it using Turing it can’t check the index bounds of strings; I assumed it required an integer value, and it does work when I use such a format.

Just to let you know,

julia> transform!(groupby(test, :site), groupindices => :site_int)

didn’t work for me. It just returned the error code:

ERROR: ArgumentError: Unrecognized column selector: DataFrames.groupindices => :site_int

The bottom bit worked good though, cheers.

@spk - please update your DataFrames.jl installation. You must be on some old version of the package. If you used DataFrames.jl 1.4 release it would work.