Is there a dropna() for DataFrame?

juliohm · January 30, 2017, 3:17am

How to drop the rows of a DataFrame that contain NA in any column? dropna() apparently only works with DataArray?

mkborregaard · January 30, 2017, 6:12am

complete_cases <20 char>

juliohm · January 30, 2017, 4:49pm

Perfect @mkborregaard, exactly what I needed. I think the naming here should be reworked though, things should be called the same for DataArray and DataFrame objects.

mkborregaard · January 30, 2017, 6:18pm

I am not sure I agree, but I am sure it would be taken into consideration if you open an issue on Dataframes.

hpoit · March 20, 2017, 8:38pm

After running complete_cases!(df) my data frame goes from 40,000 rows to 929, a bummer.

Is there a dropna() for data frames that doesn’t drop the entire row line with other columns if there is just one column within that row line with an NA? In other words, a dropna() that drops only the element with the NA?

nalimilan · March 20, 2017, 9:14pm

No, that doesn’t make sense: what are you going to put in the position where there was a missing value?

hpoit · March 20, 2017, 11:11pm

Like I wrote, the element contains an NA, not a missing value. In the place of the NA, there would be nothing. This pretty much boils down to tweaking the complete_cases!(df) code to search for NAs and simply erase them when found. I don’t know if this would compromise the integrity of the data frame (to have columns of different lengths), I don’t see why it should.

fengyang.wang · March 21, 2017, 4:01am

An NA is a missing value, by definition.

nalimilan · March 21, 2017, 9:25am

By definition a DataFrame is composed of columns with the same numbers of elements. They are not just a list of arbitrary vectors: the elements in the same position of each column correspond to the same observation/row. Removing an element in the middle of a vector breaks this.

You’ll need a different data structure if the notion of observation does not exist in your application.

pkofod · March 21, 2017, 9:36am

This makes no sense. Removing the NA and substituting in nothing? That’s just another way of expressing missing values (with the nothing value), but that doesn’t give you more or less data to work with.

hpoit · March 21, 2017, 10:35am

By “nothing” I meant an emptiness. Not an element with a nothing value, but a non-element.

hpoit · March 21, 2017, 10:41am

I thought an NA was an element with an NA value, and a missing value a non-element.

hpoit · March 21, 2017, 10:51am

Actually, I wouldn’t say dropna() is equivalent to complete_cases!(), since complete_cases!() removes not just the NA element. Is there an option where I just lose the NAs that does not translate to imputation?

pkofod · March 21, 2017, 10:59am

I feel like you need to state your problem and use case a bit clearer.

Evizero · March 21, 2017, 12:02pm

I am sorry to say, but it seems to me that there is a deeper misunderstanding going on. At the very least there is a miscommunication, because the way you want to deal with NA’s just does not make sense in the context of data tables. Even if it were done this way implementation-wise you would gain nothing.

If the function you want to pass your data to can’t deal with NA’s, then what you are attempting to do here is suppress the symptom for a problem that will still persist. The worst case result of that is that no errors are thrown and you will end up with nonsensical results. If you have a data set with missing data you need to deal with that fact one way or the other. There are different strategies, such as dropping non-complete cases, dropping whole columns that have any missing data, or imputing missing values with some scheme appropriate for that domain. Some machine learning models can even work with missing data directly (such as xgboost).

This looks like an XY Problem to me. Maybe we could be of more help if you describe your concrete use case, like @pkofod suggested

hpoit · March 21, 2017, 1:25pm

How do you drop only the non-complete case, and not take out the entire row line with it (but then again, will this compromise the data frame integrity, I don’t see why it should)? BayesNets.jl doesn’t accept NA, and the other option I know how to do is complete_cases!(), which takes the Winton Capital dataset from 40,000 instances to 929, not desirable. Imputation would be a pain, and really equal to just taking out the elements with NA.

I’m foreseeing however for other use cases e.g. Boltzmann.jl or any other use case. I’m not even thinking of use cases really, but on minimum loss of information of the dataset.

pkofod · March 21, 2017, 8:32pm

It’s still really not clear what you want to do. Dropping a non-complete case MEANS to remove the row, so yes that is absolutely a necessary result of the operation.

What do you mean by “not desirable”? It’s just the data you have!

mwsohn · March 22, 2017, 1:41am

Is this what you want?

import DataFrames.dropna
function dropna(df::DataFrame,args::Symbol...)
    if length(args) == 0
        return df[completecases(df),:]
    end

    ba = BitArray(trues(size(df,1)))
    for sym in args
        ba &= ~BitArray(df[sym].na)
    end
    return df[ba,:]
end

mkborregaard · March 22, 2017, 6:50am

I’ve often thought that a method doing this (dropping cases with NA only for a specified subset of the columns) would be a useful addition to methods using DataFrames (like R functions have an na.rm = T argument). For example, if you have a big dataset with hundreds of columns, some of which have NAs, and you want to do statistics on it, it would be nice to be able to do lm(foo ~ bar + baz, mydata, complete_cases = true), instead of having to 1) check isna manually for all terms in the call, or 2) run complete_cases and lose a lot of information because of NAs in a column that isn’t used for this operation.

nalimilan · March 22, 2017, 8:17am

I guess complete_cases could take an optional vector of column names as a second argument, and only consider missing values for these columns. Pull request welcome.

Regarding modelling functions, they are currently being redesigned in the StatsModels package, so the interface can be improved. Feel free to file an issue.

Topic		Replies	Views
DataFrames: How to remove rows containing NaNs when there are also missings General Usage dataframes	9	2405	November 28, 2024
How do I drop only rows that are fully filled with missing values? General Usage question , package , dataframes	3	183	January 24, 2023
Dealing with NaN's General Usage dataframes	21	5566	April 27, 2021
Quick DataFrame question: I am trying to read CSVs that, for some reason, have pos General Usage	1	247	February 16, 2021
How to remove rows containing missing from DataFrame? New to Julia	6	13217	July 22, 2019

Is there a dropna() for DataFrame?

Related topics