Function occursin() needs a method that accepts Union{String, Missing}

lewis · August 26, 2018, 9:53pm

I have a dataframe in which I need to filter on political party. But, because the data sometimes comes from noisy sources, garbage characters may be appended to a typical party name. So :Party .== “Democrat” (or any other party name) may fail as the string might be “Democrat-----”. So, I was using occursin to do the compare. Seems like nearly every function will need to handle missing type. A lot of work. The profusion of types serves a necessary purpose, but creates a lot of work.

Solved with this:

    dfvec = filter(row ->  0 < row[:Votecount] < 4000 && occursin("Democrat", row[:Party]),dg)

For some reason, the lambda function seems to change the type? Or, the problem was broadcasting with occursin.(stuff). In any case, this works and the size of the dataframes is such that this is not performance sensitive: from 1500 to 2200 rows.

I will also fix all of the strings: just trying to identify data cleansing chores in one function that tests each datafile.

Tamas_Papp · August 27, 2018, 6:05am

You need to make an explicit decision about missing here, occursin in particular cannot tell because you may decide that something cannot occur in a missing value (false) or propagate that as missing. You could do

(party = row[:Party]; !ismissing(party) && occursin("Democrat", party))

or similar.

BTW, one of the query/meta frameworks for DataFrames could lead to a more elegant code.

bkamins · August 27, 2018, 6:34am

You can write:

occursin.("Democrat", coalesce.(row.Party, ""))

if you want to get a vector of Bool flags with false if row.Party entry is missing (is this what you wanted).

lewis · August 27, 2018, 12:51pm

Great thanks.

I tried Query, but it won’t compile under 0.7.

lewis · August 27, 2018, 1:07pm

This would be fast and very awesome and would accomplish what I wanted for the entire dataframe. In addition to finding the missings, I need an array of indices or bools that tells me which rows don’t have missings, which your suggestion provides via bools.

Took me a second to try some examples and understand what coalesce does.

I used the filter approach with a lambda function, which was fast enough for 2000 row dataframes. I’ll go back and try your suggestion.

piever · August 27, 2018, 1:18pm

Maybe both DataFramesMeta and JuliaDBMeta could provide the following macro: @where expr skipmissing=true that detects which fields are needed to evaluate the expression (based on the symbols you used) and only evaluates the expression on rows for which these fields are not missing. Maybe using some crazy new tool like Cassette it’s possible to infer what fields are used without macros but I have no idea how one would go about that. In JuliaDBMeta OTOH this should be pretty straightforward to do.

pdeffebach · August 27, 2018, 1:35pm

You mean like stata’s if?

gen t = x + 1 if !ismissing(x)

@with df begin 
df.t = @where !missing(:x) : :x + 1

piever · August 27, 2018, 1:42pm

I’d like to make that a bit more user-friendly. What I meant is that for example in JuliaDBMeta all row-wise macros like for example @map iris :SepalLength / :SepalWidth expand to something like map(t -> t.SepalLenght / t.SepalWidth, iris, select = (:SepalLength, :SepalWidth)). What I was proposing (in JuliaDB syntax) would be, if the user passes skipmissing, to expand to:

sel= dropna(select(iris, (:SepalLength, :SepalWidth)) # or an alternative that doesn't allocate, to remove missing values
map(t -> t.SepalLenght / t.SepalWidth, sel)

pdeffebach · August 27, 2018, 2:12pm

A generic solution is good, especially to avoid so many dropna = T from R. I think lift might handle this better, since it wouldn’t drop missing values. Maybe when lift is implemented there could be a @lift macro like @. that wraps every function call in a lift function.

I like current behavior too, though since it reads like a sentence. Though i can’t vouch for its efficiency.

t = [ismissing(x) ? missing : f(x) for x in vec]

bkamins · August 27, 2018, 3:32pm

Since DataFrame is a lightweight wrapper in most cases:

completecases(df[your_cols])

should be fast enough, where your_cols if a list of columns with which you want to work. If omitted then you get indicators for whole rows.

Edit: of course this is a solution if you need to analyze more than one column. With one column broadcast ismissing. Also if you simply want to get a subset of rows (not a vector of indicators) use dropmissing.

pdeffebach · August 27, 2018, 3:55pm

Do you mean something like this would become idiomatic?

f operates on a DataFrameRow

df[:t] = [completcases(df[my_suubset])[i] ? f(df[my_subset, i]) : missing for i in nrow(df)]

It’s not clear to me how one would do this in DataFramesMeta, if you have any insights I would appreciate it.

bkamins · August 27, 2018, 10:03pm

This was a more general comment how to identify rows with non-missing data and I agreed with you that if you have one column then ismissing is natural to use.

For a single column I like what you have proposed, and it should nicely work with DataFramesMeta like (this is probably not so useful as a stand alone command, but it would work nicely with @linq):

df = DataFrame(source=[1,2,3])
@transform(df, target = [ismissing(x) ? missing : sin(x) for x in :source])

bkamins · August 28, 2018, 10:56am

Here add lift function for working with missing values by bkamins · Pull Request #26661 · JuliaLang/julia · GitHub I have updated the PR regarding lift which what I think that would be nice to have. So if anyone is interested please comment. With this functionality we could simply write things like:

@transform(df, newcol=liftmissing(some_function).(:old_col1, :old_col2, :old_col3))

and if any of old_col1, old_col2, old_col3 contained missing we would get a missing in result.