DataFrames: ByRow fails in transform with PooledArrays after CSV.read

I am having an issue with a transform operation on a DataFrame that is read from a .csv using CSV. I donโ€™t have the following issue when my dataframe has few rows because the importing from the .csv file uses SentinelArrays. However, when I have a larger dataframe, PooledArrays are used.

The problem is the following: I have a column with missing values. I would like to replace each missing value with a unique symbol using gensym(). The transformation works as expected on the freshly create dataframe, but when I save it as a .csv and read it into a new DataFrame, all missing values are replaced with the same gensym(). Any thoughts here? If I try to use map instead of transform, the same issue occurs.

julia> using CSV, DataFrames

julia> x=DataFrame(A=vcat("C",repeat([missing],1000)),B=3)
1001ร—2 DataFrame
  Row โ”‚ A        B     
      โ”‚ String?  Int64 
โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    1 โ”‚ C            3
    2 โ”‚ missing      3
    3 โ”‚ missing      3
    4 โ”‚ missing      3
    5 โ”‚ missing      3
  โ‹ฎ   โ”‚    โ‹ฎ       โ‹ฎ
  995 โ”‚ missing      3
  996 โ”‚ missing      3
  997 โ”‚ missing      3
  998 โ”‚ missing      3
  999 โ”‚ missing      3
 1000 โ”‚ missing      3
 1001 โ”‚ missing      3
       955 rows omitted

julia> CSV.write("x.csv",x)
"x.csv"

julia> transform(x,:A => ByRow(i -> ismissing(i) ? gensym() : i) => :A)
1001ร—2 DataFrame
  Row โ”‚ A       B     
      โ”‚ Any     Int64 
โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    1 โ”‚ C           3
    2 โ”‚ ##1754      3
    3 โ”‚ ##1755      3
    4 โ”‚ ##1756      3
    5 โ”‚ ##1757      3
     โ‹ฎ   โ”‚   โ‹ฎ       โ‹ฎ
  995 โ”‚ ##2747      3
  996 โ”‚ ##2748      3
  997 โ”‚ ##2749      3
  998 โ”‚ ##2750      3
  999 โ”‚ ##2751      3
 1000 โ”‚ ##2752      3
 1001 โ”‚ ##2753      3
      955 rows omitted

julia> y=CSV.read("x.csv",DataFrame);

julia> transform(y,:A => ByRow(i -> ismissing(i) ? gensym() : i) => :A)
1001ร—2 DataFrame
  Row โ”‚ A       B     
      โ”‚ Any     Int64 
โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    1 โ”‚ C           3
    2 โ”‚ ##2755      3
    3 โ”‚ ##2755      3
    4 โ”‚ ##2755      3
    5 โ”‚ ##2755      3
  โ‹ฎ   โ”‚   โ‹ฎ       โ‹ฎ
  995 โ”‚ ##2755      3
  996 โ”‚ ##2755      3
  997 โ”‚ ##2755      3
  998 โ”‚ ##2755      3
  999 โ”‚ ##2755      3
 1000 โ”‚ ##2755      3
 1001 โ”‚ ##2755      3
      955 rows omitted

So, I found I can use the keyword argument pool = false when reading the .csv to fix this issue.

@bkamins, should I submit an issue on DataFrames.jl for this? or is this behavior expected when applying transform to a string? column that is pooled?

Can you try to make an MWE that omits DataFrames and file it at CSV.jl?

So itโ€™s a CSV.jl issue? I thought it would be a DataFrames.jl issue since the problem occurs when trying to transform the data because it has been compressed by CSV.

Itโ€™s a SentinalArrays issue, and therefore a CSV issue I think. Note that ByRow(f)(x) is just going to do f.(x), so ByRow isnโ€™t doing anything unique with this.

It actually has to do with PooledArrays. Since these are compressed, Iโ€™m guessing all missing items are compressed to the same object, which is why they get replaced with the same Symbol with gensym().

So maybe pool=false should be the default, but perhaps there are other (better) reasons why pool=true is the default.

This is a DataFrames.jl/PooledArrays.jl issue. It is tracked in https://github.com/JuliaData/PooledArrays.jl/issues/63. I have opened https://github.com/JuliaData/DataFrames.jl/issues/2834 to make sure it is resolved soon in DataFrames.jl.

The reason is https://github.com/JuliaData/PooledArrays.jl/blob/35ecfd186c5e0f1aba1fc278e93766f3258f9cc3/src/PooledArrays.jl#L307

2 Likes