DataFrame can't get filter to get rid of rows where a column contains a blank " "

Hi there

I have a dataframe ( df_list_of_clles ) that has a column ( toffer) that contains missing values and blanks. A blank in this case is " " .

I can get rid of the missing values using the below but isnothing doesn’t seem to work. What am I doing wrong please?

filter(:toffer => x -> !(ismissing(x) || isnothing(x)) , df_list_of_clles )

Maybe you want to add == " " instead?

Or better yet, replace " " with missing before you get to this step in the data cleaning.

Julia is not like Stata where empty strings are equivalent to missing.

excellent idea to replace " " with missing but what’s the best way to do that please? this is my first time with DataFrames.jl.

isnothing(x) tests if x is nothing. Since " " is not nothing (or even a Nothing), isnothing(" ") is false.

1 Like

DataFrames is not different from base Julia, a column in a DataFrame is just a plain old Vector:

julia> x = ["a", missing, " "]
3-element Vector{Union{Missing, String}}:
 "a"
 missing
 " "

julia> replace(x, " " => missing)
3-element Vector{Union{Missing, String}}:
 "a"
 missing
 missing

would still be the same if x was a column in a DataFrame - it would just be df.x

3 Likes

Perfect! I almost had it but dropped the ball before the try-line :slight_smile: I tried this but it didn’t seem to work.

replace!(df_list_of_clles.toffer, " "  => missing ) # specific column 

You solution ( again saving the day) worked.
thanks

thank you for filling in another gap in my julia knowledge. I’ve added the

isnothing(" ")

to my list of things to try before bugging you good people. Right now I’m suffering from overload and NOT enjoying fighting with vscode.

Glad it worked, but I think I should clarify so as not to mislead you: replace! is generally preferred as it will replace elements within a vector in place, while replace allocates a new vector:

julia> x = rand(["a", " ", missing], 100_000);

julia> @btime replace($x, " " => missing) setup=(x=$x);
  1.124 ms (2 allocations: 781.30 KiB)

julia> @btime replace!($x, " " => missing) setup=(x=$x);
  652.087 ΞΌs (0 allocations: 0 bytes)

Again as a DataFrame is just a collection of plain Vectors so:

julia> df = DataFrame(x = 1:3, toffer = ["a", missing, " "])
3Γ—2 DataFrame
 Row β”‚ x      toffer  
     β”‚ Int64  String? 
─────┼────────────────
   1 β”‚     1  a
   2 β”‚     2  missing 
   3 β”‚     3

julia> replace!(df.toffer, " " => missing); df
3Γ—2 DataFrame
 Row β”‚ x      toffer  
     β”‚ Int64  String? 
─────┼────────────────
   1 β”‚     1  a
   2 β”‚     2  missing 
   3 β”‚     3  missing 

If that didn’t work for you it’s likely that your toffer column wasn’t a Union{Missing, T} but just a Vector{T} that can’t hold missings.

More broadly you should ask yourself how you end up with " " in your Data - if you’re reading from csv or some other delimited text format (or similarly formatted data from an IOBuffer) you can use CSV.read(data, DataFrame; missingstring = " ") to read the data in correctly in the first place.

4 Likes

good morning @nilshg

an excellent surprise gift of more tools in my toolbox. Thank you so much for going the extra mile, AGAIN!

CSV.read(data, DataFrame; missingstring = " ") 

Is something I missed in the CSV documentation so you have prompted me to go back and revisit that. I had to rush things. Thanks again for all your help on this.

2 Likes