Query @mutate changes other column type

Hi all, this is my smallest code that I fight with:

contents = """
"5674012","aa66aa66"
"5674012","9b4e08e5"
"5674012","b036aa66,b036aa67,b036aa68"
""";

batches = CSV.File(IOBuffer(contents); header = ["X1", "Splits"], delim = ',') |> DataFrame;

emptyStringArray = Array{SubString{String},1}()
batches5 = batches |> 
            @mutate(Splits = length(_.Splits) > 0 ? split(_.Splits, ',') : emptyStringArray) |> DataFrame
batches6 = batches |> 
            @mutate(Splits = length(_.Splits) > 0 ? split(_.Splits, ',') : Array{SubString{String},1}()) |> DataFrame

Note the assignments to batches5 and batches6. The result looks like this:

julia> batches5
3Γ—2 DataFrame
β”‚ Row β”‚ X1      β”‚ Splits                               β”‚
β”‚     β”‚ Any     β”‚ Any                                  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5674012 β”‚ ["aa66aa66"]                         β”‚
β”‚ 2   β”‚ 5674012 β”‚ ["9b4e08e5"]                         β”‚
β”‚ 3   β”‚ 5674012 β”‚ ["b036aa66", "b036aa67", "b036aa68"] β”‚

julia> batches6
3Γ—2 DataFrame
β”‚ Row β”‚ X1      β”‚ Splits                               β”‚
β”‚     β”‚ Int64   β”‚ Array{SubString{String},1}           β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5674012 β”‚ ["aa66aa66"]                         β”‚
β”‚ 2   β”‚ 5674012 β”‚ ["9b4e08e5"]                         β”‚
β”‚ 3   β”‚ 5674012 β”‚ ["b036aa66", "b036aa67", "b036aa68"] β”‚

What I don’t understand:

  1. Splits column type differs different - Any vs. Array{SubString{String},1}. Why? I just wanted to save memory so I stored the value (that can be repeated) to emptyStringArray.

  2. Even if I understand that I made something bad to column :Splits, I don’t understand why X1’s type is changed to Any in batches6. I would thought that the columns are independent. Something wrong with @mutate?

Could somebody clarify, what’s going on here?

I don’t know what Emptystringarray is. The reason for the Any is likely that there is no method promote_type(x::Vector{<AbstractString}, y::Emptystringarray) so julia defaults to promoting the vector to type Any.

1 Like

the emptyStringArray is defined in my sample code.

sorry, I didn’t see that. I’m not sure the behavior, then. This is odd.

Filled a bug, so let’s see… @mutate changes other column type Β· Issue #305 Β· queryverse/Query.jl Β· GitHub

The problem is that type inference sometimes breaks down if you reference a global variable in a closure, and Query.jl depends on type inference not breaking down right now :slight_smile:

Two ways to fix this at the moment:

  1. You can declare emptyStringArray to be const, i.e. const emptyStringArray = ...
  2. You can put emptyStringArray and the query into a function

The proper solution to this is to drop the dependency on type inference in Query.jl. It has been on my todo list for about 2-3 years now :slight_smile: There is no fundamental reason that this could not be done, but it is a bit of a pain to implement. At some point I’ll push myself to do it, but no promises.

@davidanthoff I’m really newbie in this, so I’m only trying to guess what you wanted to say :wink:

Anyway, here is another example, this time without global variable:

julia> batches
3Γ—2 DataFrame
β”‚ Row β”‚ X1      β”‚ Splits                     β”‚
β”‚     β”‚ Int64   β”‚ String                     β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5674012 β”‚ aa66aa66                   β”‚
β”‚ 2   β”‚ 5674012 β”‚ 9b4e08e5                   β”‚
β”‚ 3   β”‚ 5674012 β”‚ b036aa66,b036aa67,b036aa68 β”‚

julia> batches5b = batches |>
                   @mutate(Splits = length(_.Splits) > 0 ? split(_.Splits, ',') : 1.) |> DataFrame
3Γ—2 DataFrame
β”‚ Row β”‚ X1      β”‚ Splits                               β”‚
β”‚     β”‚ Any     β”‚ Any                                  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5674012 β”‚ ["aa66aa66"]                         β”‚
β”‚ 2   β”‚ 5674012 β”‚ ["9b4e08e5"]                         β”‚
β”‚ 3   β”‚ 5674012 β”‚ ["b036aa66", "b036aa67", "b036aa68"] β”‚

See the column types change (batches vs. batches5b). Is your comment still relevant, or is that something new?

So the reason the Splits column here turns into Any is that the expression length(_.Splits) > 0 ? split(_.Splits, ',') : 1. can return either an array of SubString, or a Float64 value. So we need to make the column type one that can handle both of these cases, which is Any. We call this type of situation a type instability, because that expression returns a value of a different type depending on the values of the inputs.

That the X1 column then also turns into Any is at the end of the day a bug in Query.jl that is just cumbersome to fix for me :slight_smile: But, that is no good excuse, of course. That bug is triggered by the type instability in the other column.

1 Like

Thank you. Well explained, I kind of expected something like this. Anyway, good to know :slight_smile: Really appreciate you clarified it to me.