Filter dataframe with regular expression

Hello,

I am trying to filter a dataframe based on values in certain column. This works fine when I use subset() with ByRow(in([“mytext”])). But I get problems with regular expressions:

df = subset(df, :mycol => x -> ByRow(occursin(r"mytext", x) ))

ERROR: MethodError: no method matching occursin(::Regex, ::Array{String31,1})
Closest candidates are:
  occursin(::Regex, ::SubString; offset) at regex.jl:176
  occursin(::Regex, ::AbstractString; offset) at regex.jl:171
subset(df, :mycol => ByRow(x -> occursin(r"mytext", x) ))

should work, or using DataFramesMeta.jl:

@rsubset(df, occursin(r"mytext", :mycol))

Tried something similar through with a different dataframe and i get an error.

ERROR: LoadError: MethodError: no method matching occursin(::Regex, ::Missing)
Closest candidates are:
  occursin(::Regex, ::SubString; offset) at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/regex.jl:269
  occursin(::Regex, ::AbstractString; offset) at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/regex.jl:264
  occursin(::Any) at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/strings/search.jl:636

Wondering what i am missing

You are (quite literally) missing some data (sorry couldn’t resist).

julia> occursin(r"abc", "abcde")
true

julia> occursin(r"abc", missing)
ERROR: MethodError: no method matching occursin(::Regex, ::Missing)

Thanks! I have a large dataset of >2M records and I was of the idea that the target filter column was complete but yes it had a number of missing values! i have dropped the missing values and the subset works. PS: newish to Julia and DataFrames and migrating from R :slight_smile:

As a hint, most environments (REPL, VSCode, IJulia, Pluto…) will print the column type at the top of the table, underneath the column names. This will generally alert you to the presence of missing values either with a question mark or an explicit Union{Float64, missing} as column type:

julia> DataFrame(a = rand(5), b = rand([missing; 1:5], 5))
5×2 DataFrame
 Row │ a          b       
     │ Float64    Int64?  
─────┼────────────────────
   1 │ 0.403111         3
   2 │ 0.726926   missing 
   3 │ 0.891562         5
   4 │ 0.0909533        5
   5 │ 0.820027         4

Notice the question mark in Int64? for the second column. This doesn’t necessarily mean that there are missing values, as the type won’t change after you’ve filtered missings out, but it can at least prompt you to check!

Use:

subset(df, :mycol => ByRow(x -> passmissing(occursin)(r"mytext", x)), skipmissing=true)

to make the operation return missing when the input has missing. Then by skipmissing=true you will skip them.

You can also use the @passmissing flag in a DataFramesMeta.jl transformation

@rsubset @passmissing occursin(r"mytext", :x)

I’m not sure if I misunderstood the question, but I was trying to do a similar thing and I solved it by doing:

df[occursin.(r"mytext", df.col_name), :]

Not sure if the solution with subset performs better or if they are equivalent, but I tried the subset and it didn’t work for me…