Remove / replace undef values in DataFrame

Hi,
Is there a way to replace all undef value in a Dataframe column with NA or another value?
And how do I get rid of all the rows that have at least 1 undef value?
Thank you

I think the first question should be how the undefs got there in the first place because that should never happen.

DataFrames also seems to prevent this:

julia> o = Vector{String}(undef, 10)
10-element Vector{String}:
 #undef
[...]

julia> p = DataFrame(o)
ERROR: UndefRefError: access to undefined reference
[...]

Could you provide a MWE (minimal working example)?

Here is a DataFrame that contains undef values:

df = DataFrame(
  A = Vector{String}(undef, 5),
  B = [5,1,2,undef,4]
)

Ah I see, thank you.

If you want to fill an array with values like

a = zeros(10)
for i in eachindex(a)
    a[i] = 5
end

you can avoid filling the array with zeros (zeros(10)) since you know you are going to overwrite the values anyway. So here using

a = Vector{Float64}(undef, 10)
for i in eachindex(a)
    a[i] = 5
end

would save you a tiny amount of time.

However, this is pretty much the only time you should use undef. If you have code where you can access undefs, something went wrong. Since you seem to have control over the arrays themselves I would advise you to overwrite the undefs immediately or not use them in the first place.

What are you using them for? Maybe something like missing would be more fitting (and way easier and safer to work with)?

I did not build the DataFrame, I got it like this in a .jls file and I loaded the DataFrame with the following code:

df = Serialization.deserialize("data.jls")

Well, it sounds like something went wrong with the deserialization (possibly due to “In general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.”?). For anything but short term saving probably CSV.jl, JDF.jl, etc. would be a better choice in the future.

If you have any way to access the original data that would probably the easiest option but if not I appreciate all that advice is not going to help you.

I hacked something together that can check if a single element is undef. Maybe someone else knows of a better way.

df = DataFrame(
  A = Vector{String}(undef, 5),
  B = [5,1,2,undef,4]
)
a = df.A
isassigned(a)  # false
a[1] = "hi"
isassigned(a)  # true
isassigned(Ref(@view a[1]).x)  # true
isassigned(Ref(@view a[2]).x)  # false
b = df.B
b[3] === UndefInitializer()  # false
b[4] === UndefInitializer()  # true

Using this I would go through all the data to replace all the undefs.

(PS: make extra sure the rest of your data is intact. It seems unlikely to me that something blew a few holes in your dataset but the rest was untouched :wink:)

This does not seem like a realistic example to me. This is not how a Vector{Int} with undefined positions would look like at all. [5,1,2,undef,4] is a Vector{Any} with 3 Int and a UndefInitializer object that would never exist there unless something very wrong was done. If you create a Vector{Int} with Vector{Int}(undef, 4) it will look like:

4-element Array{Int64,1}:
 140534710323696
 140534774110064
 140534710323728
               0

It will never have an undef inside it, because Int returns true for isbitstype and, therefore, each undefined position is just the Int value of the dirty bits from the memory allocated for the array. There is no way to represent undef with an Int.

I made up a DataFrame with undef. The DataFrame I use only have String columns and I can’t share it here.