Filter dataframe with regular expression

Hello,

I am trying to filter a dataframe based on values in certain column. This works fine when I use subset() with ByRow(in([“mytext”])). But I get problems with regular expressions:

df = subset(df, :mycol => x -> ByRow(occursin(r"mytext", x) ))

ERROR: MethodError: no method matching occursin(::Regex, ::Array{String31,1})
Closest candidates are:
  occursin(::Regex, ::SubString; offset) at regex.jl:176
  occursin(::Regex, ::AbstractString; offset) at regex.jl:171
subset(df, :mycol => ByRow(x -> occursin(r"mytext", x) ))

should work, or using DataFramesMeta.jl:

@rsubset(df, occursin(r"mytext", :mycol))
4 Likes

Tried something similar through with a different dataframe and i get an error.

ERROR: LoadError: MethodError: no method matching occursin(::Regex, ::Missing)
Closest candidates are:
  occursin(::Regex, ::SubString; offset) at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/regex.jl:269
  occursin(::Regex, ::AbstractString; offset) at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/regex.jl:264
  occursin(::Any) at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/strings/search.jl:636

Wondering what i am missing

You are (quite literally) missing some data (sorry couldn’t resist).

julia> occursin(r"abc", "abcde")
true

julia> occursin(r"abc", missing)
ERROR: MethodError: no method matching occursin(::Regex, ::Missing)

Thanks! I have a large dataset of >2M records and I was of the idea that the target filter column was complete but yes it had a number of missing values! i have dropped the missing values and the subset works. PS: newish to Julia and DataFrames and migrating from R :slight_smile:

1 Like

As a hint, most environments (REPL, VSCode, IJulia, Pluto…) will print the column type at the top of the table, underneath the column names. This will generally alert you to the presence of missing values either with a question mark or an explicit Union{Float64, missing} as column type:

julia> DataFrame(a = rand(5), b = rand([missing; 1:5], 5))
5×2 DataFrame
 Row │ a          b       
     │ Float64    Int64?  
─────┼────────────────────
   1 │ 0.403111         3
   2 │ 0.726926   missing 
   3 │ 0.891562         5
   4 │ 0.0909533        5
   5 │ 0.820027         4

Notice the question mark in Int64? for the second column. This doesn’t necessarily mean that there are missing values, as the type won’t change after you’ve filtered missings out, but it can at least prompt you to check!

1 Like

Use:

subset(df, :mycol => ByRow(x -> passmissing(occursin)(r"mytext", x)), skipmissing=true)

to make the operation return missing when the input has missing. Then by skipmissing=true you will skip them.

2 Likes

You can also use the @passmissing flag in a DataFramesMeta.jl transformation

@rsubset @passmissing occursin(r"mytext", :x)
1 Like