I have a dataframe in which I need to filter on political party. But, because the data sometimes comes from noisy sources, garbage characters may be appended to a typical party name. So :Party .== “Democrat” (or any other party name) may fail as the string might be “Democrat-----”. So, I was using occursin to do the compare. Seems like nearly every function will need to handle missing type. A lot of work. The profusion of types serves a necessary purpose, but creates a lot of work.
For some reason, the lambda function seems to change the type? Or, the problem was broadcasting with occursin.(stuff). In any case, this works and the size of the dataframes is such that this is not performance sensitive: from 1500 to 2200 rows.
I will also fix all of the strings: just trying to identify data cleansing chores in one function that tests each datafile.
You need to make an explicit decision about missing here, occursin in particular cannot tell because you may decide that something cannot occur in a missing value (false) or propagate that as missing. You could do
This would be fast and very awesome and would accomplish what I wanted for the entire dataframe. In addition to finding the missings, I need an array of indices or bools that tells me which rows don’t have missings, which your suggestion provides via bools.
Took me a second to try some examples and understand what coalesce does.
I used the filter approach with a lambda function, which was fast enough for 2000 row dataframes. I’ll go back and try your suggestion.
Maybe both DataFramesMeta and JuliaDBMeta could provide the following macro: @where expr skipmissing=true that detects which fields are needed to evaluate the expression (based on the symbols you used) and only evaluates the expression on rows for which these fields are not missing. Maybe using some crazy new tool like Cassette it’s possible to infer what fields are used without macros but I have no idea how one would go about that. In JuliaDBMeta OTOH this should be pretty straightforward to do.
I’d like to make that a bit more user-friendly. What I meant is that for example in JuliaDBMeta all row-wise macros like for example @map iris :SepalLength / :SepalWidth expand to something like map(t -> t.SepalLenght / t.SepalWidth, iris, select = (:SepalLength, :SepalWidth)). What I was proposing (in JuliaDB syntax) would be, if the user passes skipmissing, to expand to:
sel= dropna(select(iris, (:SepalLength, :SepalWidth)) # or an alternative that doesn't allocate, to remove missing values
map(t -> t.SepalLenght / t.SepalWidth, sel)
A generic solution is good, especially to avoid so many dropna = T from R. I think lift might handle this better, since it wouldn’t drop missing values. Maybe when lift is implemented there could be a @lift macro like @. that wraps every function call in a lift function.
I like current behavior too, though since it reads like a sentence. Though i can’t vouch for its efficiency.
Since DataFrame is a lightweight wrapper in most cases:
completecases(df[your_cols])
should be fast enough, where your_cols if a list of columns with which you want to work. If omitted then you get indicators for whole rows.
Edit: of course this is a solution if you need to analyze more than one column. With one column broadcast ismissing. Also if you simply want to get a subset of rows (not a vector of indicators) use dropmissing.
This was a more general comment how to identify rows with non-missing data and I agreed with you that if you have one column then ismissing is natural to use.
For a single column I like what you have proposed, and it should nicely work with DataFramesMeta like (this is probably not so useful as a stand alone command, but it would work nicely with @linq):
df = DataFrame(source=[1,2,3])
@transform(df, target = [ismissing(x) ? missing : sin(x) for x in :source])