Easiest quickest way to search DataFrame w/wildcards

Is there an easy straightforward already existing way to search a DataFrame with a String wildcard in a specific column, and then print the found entries/rows?

For example, you want to search if a certain name is in the data, but don’t remember it exactly. You wish to search with an approximate wildcard, such as *batman* (where the stars in this denote the wildcards).

If I can avoid regex, by the way, that’s good.

[c for c in names(dd) if occursin("batman", string(c))]? For any more complex patterns I don’t see how you get by without regex.

You also might have noticed that DataFrames has a convenience function that improves error messages for typos of column names - if the column you’re requesting is close to the name of an existing column it will suggest that similarly named column. So if it’s really just about not exactly remembering a name you can just try an approximate name and let the error message help you.

Something like this?

julia> df = DataFrame(:a => ["foo", "bar", "baz"])
3Γ—1 DataFrame
β”‚ Row β”‚ a      β”‚
β”‚     β”‚ String β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ foo    β”‚
β”‚ 2   β”‚ bar    β”‚
β”‚ 3   β”‚ baz    β”‚

julia> df[findall(x -> occursin("ba", x), df[!, :a]), :]
2Γ—1 DataFrame
β”‚ Row β”‚ a      β”‚
β”‚     β”‚ String β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ bar    β”‚
β”‚ 2   β”‚ baz    β”‚
1 Like

Ah sorry I read the question as asking about the name of a column rather than the data within a column. That said, why don’t you broadcast occursin like

julia> df[occursin.("ba", df.a), :]
2Γ—1 DataFrame
β”‚ Row β”‚ a      β”‚
β”‚     β”‚ String β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ bar    β”‚
β”‚ 2   β”‚ baz    β”‚
3 Likes

Thanks, both. Very useful.

The use of occursin came to mind, but now you’ve shown me how to use it in this particular context.

Standing now in this case on your shoulders, I’m wondering if I can make a general quick function with this information. If I have time, I hope to do so.

It can be generalized a little

mysearch(smt, somewhere) = false
mysearch(smt::AbstractString, somewhere::AbstractString) = occursin(smt, somewhere)

findsomethingsomewhere(smt, df) =
    df[reduce((x, y) -> x .| mysearch.(smt, df[!, y]), names(df);
        init = BitArray([false for _ in 1:nrow(df)])), :]
julia> df = DataFrame(:a => ["foo", "bar", "baz", "baa"], :b => 1:4, :c => ["a", "b", "c", "ar"])
4Γ—3 DataFrame
β”‚ Row β”‚ a      β”‚ b     β”‚ c      β”‚
β”‚     β”‚ String β”‚ Int64 β”‚ String β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ foo    β”‚ 1     β”‚ a      β”‚
β”‚ 2   β”‚ bar    β”‚ 2     β”‚ b      β”‚
β”‚ 3   β”‚ baz    β”‚ 3     β”‚ c      β”‚
β”‚ 4   β”‚ baa    β”‚ 4     β”‚ ar     β”‚

julia> julia> findsomethingsomewhere("ar", df)
2Γ—3 DataFrame
β”‚ Row β”‚ a      β”‚ b     β”‚ c      β”‚
β”‚     β”‚ String β”‚ Int64 β”‚ String β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ bar    β”‚ 2     β”‚ b      β”‚
β”‚ 2   β”‚ baa    β”‚ 4     β”‚ ar     β”‚
2 Likes

Here is a somewhat more readable function

function findsomethingsomewhere(smt, df)
    positives = fill(false, nrow(df))
    for col in eachcol(df)
        if eltype(col) <: AbstractString
            positives .= positives .| occursin.(smt, col)
        end 
    end
    df[positives, :]
end
3 Likes

Great. Thank you both. That will be useful.