Can DataFramesMeta replace dummy values for all columns of a specific type

I’ve been trying to figure out a way to have DataFramesMet perform this operation:

transform!(df2, names(df2, String) .=> (t -> replace(t, ""=>missing)), renamecols=false)

General: Replace dummy value for all columns of a specific type with another value.

This statement selects all string columns and replaces empty string values with missing. It took me a bit to figure this out as I’m still getting familiar with DataFrames and DataFramesMeta and I believe this command is from DataFrames. I would like to use DataFramesMeta to do the same to take advantage of that package’s macro syntax.

Can it be done?

SentinelArrays.jl may be useful here

julia> df = DataFrame(x = [1,2,3,4], y = ["a", "", "c", "d"])
4×2 DataFrame
 Row │ x      y      
     │ Int64  String 
─────┼───────────────
   1 │     1  a
   2 │     2
   3 │     3  c
   4 │     4  d


julia> df.y = SentinelArray(df.y, "", missing)
4-element SentinelVector{String, String, Missing, Vector{String}}:
 "a"
 missing
 "c"
 "d"

julia> df
4×2 DataFrame
 Row │ x      y       
     │ Int64  String? 
─────┼────────────────
   1 │     1  a
   2 │     2  missing 
   3 │     3  c
   4 │     4  d

Thanks for the reply. Can this be applied to all columns in the df of a specific type at once?

You’re almost there: what you’re missing is just this .

julia> df
10×3 DataFrame
 Row │ col1  col2  c3     
     │ Char  Char  String
─────┼────────────────────
   1 │ c     H
   2 │ f     D
   3 │ b     E
   4 │ j     L
   5 │ o     M
   6 │ i     D
   7 │ l     J
   8 │ d     M
   9 │ d     D
  10 │ b     F

julia>                names(df, String)
1-element Vector{String}:
 "c3"

julia> transform!(df, names(df, String) .=> (t -> replace.(t, ""=>missing)), renamecols=false)
10×3 DataFrame
 Row │ col1  col2  c3      
     │ Char  Char  String
─────┼─────────────────────
   1 │ c     H     missing
   2 │ f     D     missing
   3 │ b     E     missing
   4 │ j     L     missing
   5 │ o     M     missing
   6 │ i     D     missing
   7 │ l     J     missing
   8 │ d     M     missing
   9 │ d     D     missing
  10 │ b     F     missing

The functions in the DataFrames package syntax are applied to the column vector(s) you pass.
If you want to apply the function per row you must wrap ByRow inside or use a function (possibly composed) that acts elementwise

Although I don’t know why there is a string instead of the missing value.
but someone else explains this to us.

I actually started with that because I thought I’d need that “.” in “replace.”, but it returns odd values, i.e. “ABC” in the field comes back as “missingAmissingBmissingCmissing” using that exact line of code. I found that removing the “.” in “replace.” returns the values correctly and replaces the “” with missing as desired.

So, I have it working as is, which is good. However, I was hoping to use DataFramesMeta @transform! if possible but I haven’t figured out how to do it yet.

try this way

julia> df = DataFrame(col1 = rand('a':'p',10), col2 = rand('A':'P',10),c3="")
10×3 DataFrame
 Row │ col1  col2  c3     
     │ Char  Char  String
─────┼────────────────────
   1 │ e     H
   2 │ h     A
   3 │ b     I
   4 │ o     O
   5 │ e     J
   6 │ e     H
   7 │ l     A
   8 │ i     M
   9 │ e     P
  10 │ n     K

julia> df.c3[[1,3,5]].="cc"
3-element view(::Vector{String}, [1, 3, 5]) with eltype String:
 "cc"
 "cc"
 "cc"

julia> df
10×3 DataFrame
 Row │ col1  col2  c3     
     │ Char  Char  String
─────┼────────────────────
   1 │ e     H     cc
   2 │ h     A
   3 │ b     I     cc
   4 │ o     O
   5 │ e     J     cc
   6 │ e     H
   7 │ l     A
   8 │ i     M
   9 │ e     P
  10 │ n     K

julia> transform!(df, names(df, String) .=> (t -> replace(Union{Missing,String}[t...], ""=>missing)), renamecols=false)
10×3 DataFrame
 Row │ col1  col2  c3      
     │ Char  Char  String?
─────┼─────────────────────
   1 │ e     H     cc
   2 │ h     A     missing
   3 │ b     I     cc
   4 │ o     O     missing
   5 │ e     J     cc
   6 │ e     H     missing
   7 │ l     A     missing
   8 │ i     M     missing
   9 │ e     P     missing
  10 │ n     K     missing
julia> transform(groupby(df,names(df, String)), names(df, String) .=> t->replace(t,""=>missing))
10×6 DataFrame
 Row │ col1   col2   c3      c4      c3_function  c4_function 
     │ Int64  Int64  String  String  String?      String?
─────┼────────────────────────────────────────────────────────
   1 │     7      8  c3      c4      c3           c4
   2 │     5      5  c3              c3           missing
   3 │     2      8          c4      missing      c4
   4 │     6      1  c3      c4      c3           c4
   5 │     9     10  c3      c4      c3           c4
   6 │    10      2  c3              c3           missing
   7 │     2     10  c3      c4      c3           c4
   8 │     8     10  c3      c4      c3           c4
   9 │     6      9          c4      missing      c4
  10 │     1      9                  missing      missing

you could try this subterfuge too, although it costs you extra work to rename the columns.

But I’m just bringing this up to clarify that it doesn’t work if you use renamecols=false

julia> df = DataFrame(col1 = rand(1:10,10), col2 = rand(1:10,10),c3=rand(["","c3"],10),c4=rand(["","c4"],10))
10×4 DataFrame
 Row │ col1   col2   c3      c4     
     │ Int64  Int64  String  String
─────┼──────────────────────────────
   1 │     7      8  c3      c4
   2 │     5      5  c3
   3 │     2      8          c4
   4 │     6      1  c3      c4
   5 │     9     10  c3      c4
   6 │    10      2  c3
   7 │     2     10  c3      c4
   8 │     8     10  c3      c4
   9 │     6      9          c4
  10 │     1      9

gstr=groupby(df,names(df, String))

julia> transform(groupby(df,names(df, String)), names(df, String) .=> t->replace(t,""=>missing), renamecols=false)
ERROR: ArgumentError: column :c3 in returned data frame is not equal to grouping key :c3

solved adding ,keepkeys=false keywarg:

julia> transform(groupby(df,names(df, String)), names(df, String) .=> t->replace(t,""=>missing), renamecols=false,keepkeys=false)
10×4 DataFrame
 Row │ col1   col2   c3       c4      
     │ Int64  Int64  String?  String?
─────┼────────────────────────────────
   1 │     7      8  c3       c4
   2 │     5      5  c3       missing
   3 │     2      8  missing  c4
   4 │     6      1  c3       c4
   5 │     9     10  c3       c4
   6 │    10      2  c3       missing
   7 │     2     10  c3       c4
   8 │     8     10  c3       c4
   9 │     6      9  missing  c4
  10 │     1      9  missing  missing

It can’t really be done with DataFramesMeta.jl, to be honest. You could do a for loop

julia> df = DataFrame(a = ["x1", ""], b = ["", "z2"], c = [1, 2]);

julia> for n in names(df, AbstractString)
           @transform! df $n = replace($n, "" => missing)
       end

julia> df
2×3 DataFrame
 Row │ a        b        c     
     │ String?  String?  Int64 
─────┼─────────────────────────
   1 │ x1       missing      1
   2 │ missing  z2           2

You just want replace(...) not replace.(...). No ..

Thanks, all, for the replies. I’m going to take the answer from @pdeffebach as the solution since that gave confirmation it couldn’t be done in DataFramesMeta.

All the other responses gave me some good info to work off of, though.

This does not work

transform!(df, names(df, String) .=> (t ->replace(t, ""=>missing)), renamecols=false)

Why does this work?

transform!(df, names(df, AbstractString) .=> (t ->replace(t, ""=>missing)), renamecols=false)

Perhaps you tried them one after the other and the mutate the type of the columns? Because they seem pretty similar.

In my test, both forms work.

:+1:

Well… I actually trusted what I understood about the OP.
I understood what the need was but I didn’t understand that he wanted to solve it with a specific tool.
So I started looking for alternative solutions, without experiencing the error I had imagined :grin: