Weird behaviour when replace the empty string in row with DataFramesMeta.jl

I have a snippet

string_number_df2 = DataFrame(
	    A = ["1", "", "3"],
	    B = ["4", "5", ""],
	)	

result = @chain string_number_df2 begin
    @rtransform :B = replace(:B, "" => missing)
end

Expectation:

A   |   B
"1" |  "4"
""  |  "5"
"3" | missing

but the actual behavior is:

It seems replace function parse the "4" as "" * "4" * "" and then replace the empty string with missing string. I can’t understand why the function works in that way.

This is not related to DataFrames.jl. It is how replace works:

julia> replace("I", "" => missing)
"missingImissing"

Instead do:

julia> result = @chain string_number_df2 begin
           @transform :B = replace(:B, "" => missing)
       end
3Γ—2 DataFrame
 Row β”‚ A       B
     β”‚ String  String?
─────┼─────────────────
   1 β”‚ 1       4
   2 β”‚         5
   3 β”‚ 3       missing

I’m new to Julia, and I’m used to programming in Python where everything is an object. I used to think that I could convert a string object to a missing object, so I was confused when I encountered the string β€œmissing4missing” in Julia.

If you use replace on a collection of strings instead (the entire dataframe column), then the result is what you expected:

using DataFrames
df = DataFrame(A = ["1", "", "3"], B = ["4", "5", ""])
df.B = replace(df.B, "" => missing)
df

It is the same with Julia. Just you need to notice that replace has two signatures (quoting its docstring):

  replace(A, old_new::Pair...; [count::Integer])

Return a copy of collection A where, for each pair old=>new in old_new,
all occurrences of old are replaced by new. Equality is
determined using isequal.
If count is specified, then replace at most count occurrences in total.

and

 replace(s::AbstractString, pat=>r, [pat2=>r2, ...]; [count::Integer])

  Search for the given pattern pat in s, and replace each occurrence with r.
If count is provided, replace at most count occurrences.

You wanted to call the first (which works on collections like a vector), but you used @rtransform which applied replace to each individual string instead of the collection as a whole (the r letter in front of transform signaled this choice).

1 Like