Function occursin() needs a method that accepts Union{String, Missing}

I have a dataframe in which I need to filter on political party. But, because the data sometimes comes from noisy sources, garbage characters may be appended to a typical party name. So :Party .== “Democrat” (or any other party name) may fail as the string might be “Democrat-----”. So, I was using occursin to do the compare. Seems like nearly every function will need to handle missing type. A lot of work. The profusion of types serves a necessary purpose, but creates a lot of work.

Solved with this:

    dfvec = filter(row ->  0 < row[:Votecount] < 4000 && occursin("Democrat", row[:Party]),dg)

For some reason, the lambda function seems to change the type? Or, the problem was broadcasting with occursin.(stuff). In any case, this works and the size of the dataframes is such that this is not performance sensitive: from 1500 to 2200 rows.

I will also fix all of the strings: just trying to identify data cleansing chores in one function that tests each datafile.

1 Like

You need to make an explicit decision about missing here, occursin in particular cannot tell because you may decide that something cannot occur in a missing value (false) or propagate that as missing. You could do

(party = row[:Party]; !ismissing(party) && occursin("Democrat", party))

or similar.

BTW, one of the query/meta frameworks for DataFrames could lead to a more elegant code.

You can write:

occursin.("Democrat", coalesce.(row.Party, ""))

if you want to get a vector of Bool flags with false if row.Party entry is missing (is this what you wanted).

2 Likes

Great thanks.

I tried Query, but it won’t compile under 0.7.

This would be fast and very awesome and would accomplish what I wanted for the entire dataframe. In addition to finding the missings, I need an array of indices or bools that tells me which rows don’t have missings, which your suggestion provides via bools.

Took me a second to try some examples and understand what coalesce does.

I used the filter approach with a lambda function, which was fast enough for 2000 row dataframes. I’ll go back and try your suggestion.

Maybe both DataFramesMeta and JuliaDBMeta could provide the following macro: @where expr skipmissing=true that detects which fields are needed to evaluate the expression (based on the symbols you used) and only evaluates the expression on rows for which these fields are not missing. Maybe using some crazy new tool like Cassette it’s possible to infer what fields are used without macros but I have no idea how one would go about that. In JuliaDBMeta OTOH this should be pretty straightforward to do.

You mean like stata’s if?

gen t = x + 1 if !ismissing(x)
@with df begin 
df.t = @where !missing(:x) : :x + 1

I’d like to make that a bit more user-friendly. What I meant is that for example in JuliaDBMeta all row-wise macros like for example @map iris :SepalLength / :SepalWidth expand to something like map(t -> t.SepalLenght / t.SepalWidth, iris, select = (:SepalLength, :SepalWidth)). What I was proposing (in JuliaDB syntax) would be, if the user passes skipmissing, to expand to:

sel= dropna(select(iris, (:SepalLength, :SepalWidth)) # or an alternative that doesn't allocate, to remove missing values
map(t -> t.SepalLenght / t.SepalWidth, sel)

A generic solution is good, especially to avoid so many dropna = T from R. I think lift might handle this better, since it wouldn’t drop missing values. Maybe when lift is implemented there could be a @lift macro like @. that wraps every function call in a lift function.

I like current behavior too, though since it reads like a sentence. Though i can’t vouch for its efficiency.

t = [ismissing(x) ? missing : f(x) for x in vec]

Since DataFrame is a lightweight wrapper in most cases:

completecases(df[your_cols])

should be fast enough, where your_cols if a list of columns with which you want to work. If omitted then you get indicators for whole rows.

Edit: of course this is a solution if you need to analyze more than one column. With one column broadcast ismissing. Also if you simply want to get a subset of rows (not a vector of indicators) use dropmissing.

Do you mean something like this would become idiomatic?

f operates on a DataFrameRow

df[:t] = [completcases(df[my_suubset])[i] ? f(df[my_subset, i]) : missing for i in nrow(df)]

It’s not clear to me how one would do this in DataFramesMeta, if you have any insights I would appreciate it.

This was a more general comment how to identify rows with non-missing data and I agreed with you that if you have one column then ismissing is natural to use.

For a single column I like what you have proposed, and it should nicely work with DataFramesMeta like (this is probably not so useful as a stand alone command, but it would work nicely with @linq):

df = DataFrame(source=[1,2,3])
@transform(df, target = [ismissing(x) ? missing : sin(x) for x in :source])
1 Like

Here https://github.com/JuliaLang/julia/pull/26661#issuecomment-416519880 I have updated the PR regarding lift which what I think that would be nice to have. So if anyone is interested please comment. With this functionality we could simply write things like:

@transform(df, newcol=liftmissing(some_function).(:old_col1, :old_col2, :old_col3))

and if any of old_col1, old_col2, old_col3 contained missing we would get a missing in result.