Help to check if an element in a Tables.jl is defined

Hi!

I am trying to print tables that comply with Tables.jl specification. However, this tables can have undefined elements, such as:

julia> df = DataFrame(B=Int64.(1:9), A = Vector{Any}(undef, 9))

I am having a problem to check if an element is undefined using Tables.jl API when I have only row access to the data. Today, I am reading a row table using this code:

    # Access index.
    i,j = inds[1], inds[2]

    # Get the element.
    it,~ = iterate(table, i)

    it == nothing &&
    error("The row `i` does not exist.")

    element = it[j]

It can be tested with DataFrames with this code:

julia> df = DataFrame(B=Int64.(1:9), A = Vector{Any}(undef, 9))

julia> table = Tables.rows(df)

julia> i,j = 2,2

julia> it,~ = iterate(table, i)

julia> element = it[j]

which fails because the element is undefined. My question is: how can I check that this element is undefined to avoid this error? Unfortunately, using try and catch will have a big performance impact here and I want to avoid it.

@quinnj - I think for Tables.row result we should add a requirement that isassigned method should be defined for it.

Also in Tables.jl the Base.show(io::IO, x::T) where {T <: RorC} method has show(io, NamedTuple(x)) line which throws an error on #undef.

1 Like

Hmmmmm…yeah, this is pretty annoying. I think in the data world, we don’t run into this much because these values are usually missing instead of #undef. I don’t love the idea of requiring isassigned because it kind of puts this fault in the entire API; oh, any row value might be #undef and I need to check that everywhere to make my code generic? It just feels like we’re turning things into Java where null can literally subvert the entire type system and you can’t trust anything.

I’d rather go the other direction and say row values have to be defined; i.e. #undef isn’t allowed. Are there cases in the data ecosystem where this actually comes up or is this just for testing? I’d rather go to problematic sources and ensure #undef doesn’t come up at all in practice.

2 Likes

The only use case I could imagine encountering is when using a table to store simulation data, and the table is created with two columns (simplest case):

  • parameters (filled immediately)
  • results (set as #undef and filled as simulations get run and results from them are collected)

But #undef can be easily replaced by nothing in this use case.

For my case, I have no preference :slight_smile: I have to handle #undef in arrays anyway, so I still need to check every time if a cell is assigned. I’m good if Tables.jl just prohibit #undef entries or if there is an API requirement to define isassigned.

My question is if this is even possible without using a sentinel for each field. isassigned works when the fields in a struct or the elements of an array point to to a heap-allocated object (because, if I am not wrong, what it does is basically checking if the pointer is null). However, if some of your columns are things like Ints and Floats, and others are heap-allocated objects, then the RowData would need to: either return a partially initialized object (and I remember a recent topic about the problems in trying to do so); or identify columns that may have unassigned elements, create a sentinel for them in the RowData, and specialize the Base.isassigned to look for those sentinels instead.

I agree. I would go with nothing or missing in practical use, but in case this is not an option, the row iterator could check isassigned and return either when needed.