Is there a dropna() for DataFrame?

How to drop the rows of a DataFrame that contain NA in any column? dropna() apparently only works with DataArray?

1 Like

complete_cases <20 char>

1 Like

Perfect @mkborregaard, exactly what I needed. I think the naming here should be reworked though, things should be called the same for DataArray and DataFrame objects.

I am not sure I agree, but I am sure it would be taken into consideration if you open an issue on Dataframes.

1 Like

After running complete_cases!(df) my data frame goes from 40,000 rows to 929, a bummer.

Is there a dropna() for data frames that doesn’t drop the entire row line with other columns if there is just one column within that row line with an NA? In other words, a dropna() that drops only the element with the NA?

No, that doesn’t make sense: what are you going to put in the position where there was a missing value?

Like I wrote, the element contains an NA, not a missing value. In the place of the NA, there would be nothing. This pretty much boils down to tweaking the complete_cases!(df) code to search for NAs and simply erase them when found. I don’t know if this would compromise the integrity of the data frame (to have columns of different lengths), I don’t see why it should.

An NA is a missing value, by definition.

By definition a DataFrame is composed of columns with the same numbers of elements. They are not just a list of arbitrary vectors: the elements in the same position of each column correspond to the same observation/row. Removing an element in the middle of a vector breaks this.

You’ll need a different data structure if the notion of observation does not exist in your application.

This makes no sense. Removing the NA and substituting in nothing? That’s just another way of expressing missing values (with the nothing value), but that doesn’t give you more or less data to work with.

By “nothing” I meant an emptiness. Not an element with a nothing value, but a non-element.

I thought an NA was an element with an NA value, and a missing value a non-element.

Actually, I wouldn’t say dropna() is equivalent to complete_cases!(), since complete_cases!() removes not just the NA element. Is there an option where I just lose the NAs that does not translate to imputation?

I feel like you need to state your problem and use case a bit clearer.

1 Like

I am sorry to say, but it seems to me that there is a deeper misunderstanding going on. At the very least there is a miscommunication, because the way you want to deal with NA’s just does not make sense in the context of data tables. Even if it were done this way implementation-wise you would gain nothing.

If the function you want to pass your data to can’t deal with NA’s, then what you are attempting to do here is suppress the symptom for a problem that will still persist. The worst case result of that is that no errors are thrown and you will end up with nonsensical results. If you have a data set with missing data you need to deal with that fact one way or the other. There are different strategies, such as dropping non-complete cases, dropping whole columns that have any missing data, or imputing missing values with some scheme appropriate for that domain. Some machine learning models can even work with missing data directly (such as xgboost).

This looks like an XY Problem to me. Maybe we could be of more help if you describe your concrete use case, like @pkofod suggested

How do you drop only the non-complete case, and not take out the entire row line with it (but then again, will this compromise the data frame integrity, I don’t see why it should)? BayesNets.jl doesn’t accept NA, and the other option I know how to do is complete_cases!(), which takes the Winton Capital dataset from 40,000 instances to 929, not desirable. Imputation would be a pain, and really equal to just taking out the elements with NA.

I’m foreseeing however for other use cases e.g. Boltzmann.jl or any other use case. I’m not even thinking of use cases really, but on minimum loss of information of the dataset.

It’s still really not clear what you want to do. Dropping a non-complete case MEANS to remove the row, so yes that is absolutely a necessary result of the operation.

What do you mean by “not desirable”? It’s just the data you have!

Is this what you want?

import DataFrames.dropna
function dropna(df::DataFrame,args::Symbol...)
    if length(args) == 0
        return df[completecases(df),:]
    end

    ba = BitArray(trues(size(df,1)))
    for sym in args
        ba &= ~BitArray(df[sym].na)
    end
    return df[ba,:]
end

1 Like

I’ve often thought that a method doing this (dropping cases with NA only for a specified subset of the columns) would be a useful addition to methods using DataFrames (like R functions have an na.rm = T argument). For example, if you have a big dataset with hundreds of columns, some of which have NAs, and you want to do statistics on it, it would be nice to be able to do lm(foo ~ bar + baz, mydata, complete_cases = true), instead of having to 1) check isna manually for all terms in the call, or 2) run complete_cases and lose a lot of information because of NAs in a column that isn’t used for this operation.

1 Like

I guess complete_cases could take an optional vector of column names as a second argument, and only consider missing values for these columns. Pull request welcome.

Regarding modelling functions, they are currently being redesigned in the StatsModels package, so the interface can be improved. Feel free to file an issue.