Hi,
Is there a way to replace all undef value in a Dataframe column with NA or another value?
And how do I get rid of all the rows that have at least 1 undef value?
Thank you
I think the first question should be how the undef
s got there in the first place because that should never happen.
DataFrames
also seems to prevent this:
julia> o = Vector{String}(undef, 10)
10-element Vector{String}:
#undef
[...]
julia> p = DataFrame(o)
ERROR: UndefRefError: access to undefined reference
[...]
Could you provide a MWE (minimal working example)?
Here is a DataFrame that contains undef values:
df = DataFrame(
A = Vector{String}(undef, 5),
B = [5,1,2,undef,4]
)
Ah I see, thank you.
If you want to fill an array with values like
a = zeros(10)
for i in eachindex(a)
a[i] = 5
end
you can avoid filling the array with zeros (zeros(10)
) since you know you are going to overwrite the values anyway. So here using
a = Vector{Float64}(undef, 10)
for i in eachindex(a)
a[i] = 5
end
would save you a tiny amount of time.
However, this is pretty much the only time you should use undef
. If you have code where you can access undef
s, something went wrong. Since you seem to have control over the arrays themselves I would advise you to overwrite the undefs
immediately or not use them in the first place.
What are you using them for? Maybe something like missing
would be more fitting (and way easier and safer to work with)?
I did not build the DataFrame, I got it like this in a .jls file and I loaded the DataFrame with the following code:
df = Serialization.deserialize("data.jls")
Well, it sounds like something went wrong with the deserialization (possibly due to “In general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.”?). For anything but short term saving probably CSV.jl, JDF.jl, etc. would be a better choice in the future.
If you have any way to access the original data that would probably the easiest option but if not I appreciate all that advice is not going to help you.
I hacked something together that can check if a single element is undef
. Maybe someone else knows of a better way.
df = DataFrame(
A = Vector{String}(undef, 5),
B = [5,1,2,undef,4]
)
a = df.A
isassigned(a) # false
a[1] = "hi"
isassigned(a) # true
isassigned(Ref(@view a[1]).x) # true
isassigned(Ref(@view a[2]).x) # false
b = df.B
b[3] === UndefInitializer() # false
b[4] === UndefInitializer() # true
Using this I would go through all the data to replace all the undef
s.
(PS: make extra sure the rest of your data is intact. It seems unlikely to me that something blew a few holes in your dataset but the rest was untouched )
This does not seem like a realistic example to me. This is not how a Vector{Int}
with undefined positions would look like at all. [5,1,2,undef,4]
is a Vector{Any}
with 3 Int
and a UndefInitializer
object that would never exist there unless something very wrong was done. If you create a Vector{Int}
with Vector{Int}(undef, 4)
it will look like:
4-element Array{Int64,1}:
140534710323696
140534774110064
140534710323728
0
It will never have an undef
inside it, because Int
returns true for isbitstype
and, therefore, each undefined position is just the Int
value of the dirty bits from the memory allocated for the array. There is no way to represent undef
with an Int
.
I made up a DataFrame with undef. The DataFrame I use only have String
columns and I can’t share it here.