Filter dataframe with regular expression

Mastomaki · February 22, 2022, 8:13am

Hello,

I am trying to filter a dataframe based on values in certain column. This works fine when I use subset() with ByRow(in([“mytext”])). But I get problems with regular expressions:

df = subset(df, :mycol => x -> ByRow(occursin(r"mytext", x) ))

ERROR: MethodError: no method matching occursin(::Regex, ::Array{String31,1})
Closest candidates are:
  occursin(::Regex, ::SubString; offset) at regex.jl:176
  occursin(::Regex, ::AbstractString; offset) at regex.jl:171

bkamins · February 22, 2022, 8:34am

subset(df, :mycol => ByRow(x -> occursin(r"mytext", x) ))

should work, or using DataFramesMeta.jl:

@rsubset(df, occursin(r"mytext", :mycol))

George_Githinji · March 1, 2022, 9:38am

Tried something similar through with a different dataframe and i get an error.

ERROR: LoadError: MethodError: no method matching occursin(::Regex, ::Missing)
Closest candidates are:
  occursin(::Regex, ::SubString; offset) at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/regex.jl:269
  occursin(::Regex, ::AbstractString; offset) at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/regex.jl:264
  occursin(::Any) at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/base/strings/search.jl:636

Wondering what i am missing

nilshg · March 1, 2022, 9:53am

You are (quite literally) missing some data (sorry couldn’t resist).

julia> occursin(r"abc", "abcde")
true

julia> occursin(r"abc", missing)
ERROR: MethodError: no method matching occursin(::Regex, ::Missing)

George_Githinji · March 1, 2022, 10:27am

Thanks! I have a large dataset of >2M records and I was of the idea that the target filter column was complete but yes it had a number of missing values! i have dropped the missing values and the subset works. PS: newish to Julia and DataFrames and migrating from R

nilshg · March 1, 2022, 10:49am

As a hint, most environments (REPL, VSCode, IJulia, Pluto…) will print the column type at the top of the table, underneath the column names. This will generally alert you to the presence of missing values either with a question mark or an explicit Union{Float64, missing} as column type:

julia> DataFrame(a = rand(5), b = rand([missing; 1:5], 5))
5×2 DataFrame
 Row │ a          b       
     │ Float64    Int64?  
─────┼────────────────────
   1 │ 0.403111         3
   2 │ 0.726926   missing 
   3 │ 0.891562         5
   4 │ 0.0909533        5
   5 │ 0.820027         4

Notice the question mark in Int64? for the second column. This doesn’t necessarily mean that there are missing values, as the type won’t change after you’ve filtered missings out, but it can at least prompt you to check!

bkamins · March 1, 2022, 11:06am

Use:

subset(df, :mycol => ByRow(x -> passmissing(occursin)(r"mytext", x)), skipmissing=true)

to make the operation return missing when the input has missing. Then by skipmissing=true you will skip them.

pdeffebach · March 1, 2022, 3:26pm

You can also use the @passmissing flag in a DataFramesMeta.jl transformation

@rsubset @passmissing occursin(r"mytext", :x)

zapiano · February 20, 2025, 2:44am

I’m not sure if I misunderstood the question, but I was trying to do a similar thing and I solved it by doing:

df[occursin.(r"mytext", df.col_name), :]

Not sure if the solution with subset performs better or if they are equivalent, but I tried the subset and it didn’t work for me…

Topic		Replies	Views
How would I write this Pandas column filter code in Julia? General Usage	5	1141	February 8, 2020
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24023	April 27, 2024
Function occursin() needs a method that accepts Union{String, Missing} General Usage	12	1780	August 28, 2018
Filtering a DataFrame column using Regex Data dataframes	2	1662	June 4, 2020
Subset does not contain certain string New to Julia question	2	637	November 26, 2021

Filter dataframe with regular expression

Related topics