DataFrames: ByRow fails in transform with PooledArrays after CSV.read

hdavid16 · August 6, 2021, 5:26pm

I am having an issue with a transform operation on a DataFrame that is read from a .csv using CSV. I don’t have the following issue when my dataframe has few rows because the importing from the .csv file uses SentinelArrays. However, when I have a larger dataframe, PooledArrays are used.

The problem is the following: I have a column with missing values. I would like to replace each missing value with a unique symbol using gensym(). The transformation works as expected on the freshly create dataframe, but when I save it as a .csv and read it into a new DataFrame, all missing values are replaced with the same gensym(). Any thoughts here? If I try to use map instead of transform, the same issue occurs.

julia> using CSV, DataFrames

julia> x=DataFrame(A=vcat("C",repeat([missing],1000)),B=3)
1001×2 DataFrame
  Row │ A        B     
      │ String?  Int64 
──────┼────────────────
    1 │ C            3
    2 │ missing      3
    3 │ missing      3
    4 │ missing      3
    5 │ missing      3
  ⋮   │    ⋮       ⋮
  995 │ missing      3
  996 │ missing      3
  997 │ missing      3
  998 │ missing      3
  999 │ missing      3
 1000 │ missing      3
 1001 │ missing      3
       955 rows omitted

julia> CSV.write("x.csv",x)
"x.csv"

julia> transform(x,:A => ByRow(i -> ismissing(i) ? gensym() : i) => :A)
1001×2 DataFrame
  Row │ A       B     
      │ Any     Int64 
──────┼───────────────
    1 │ C           3
    2 │ ##1754      3
    3 │ ##1755      3
    4 │ ##1756      3
    5 │ ##1757      3
     ⋮   │   ⋮       ⋮
  995 │ ##2747      3
  996 │ ##2748      3
  997 │ ##2749      3
  998 │ ##2750      3
  999 │ ##2751      3
 1000 │ ##2752      3
 1001 │ ##2753      3
      955 rows omitted

julia> y=CSV.read("x.csv",DataFrame);

julia> transform(y,:A => ByRow(i -> ismissing(i) ? gensym() : i) => :A)
1001×2 DataFrame
  Row │ A       B     
      │ Any     Int64 
──────┼───────────────
    1 │ C           3
    2 │ ##2755      3
    3 │ ##2755      3
    4 │ ##2755      3
    5 │ ##2755      3
  ⋮   │   ⋮       ⋮
  995 │ ##2755      3
  996 │ ##2755      3
  997 │ ##2755      3
  998 │ ##2755      3
  999 │ ##2755      3
 1000 │ ##2755      3
 1001 │ ##2755      3
      955 rows omitted

hdavid16 · August 6, 2021, 5:35pm

So, I found I can use the keyword argument pool = false when reading the .csv to fix this issue.

@bkamins, should I submit an issue on DataFrames.jl for this? or is this behavior expected when applying transform to a string? column that is pooled?

pdeffebach · August 6, 2021, 5:48pm

Can you try to make an MWE that omits DataFrames and file it at CSV.jl?

hdavid16 · August 6, 2021, 7:25pm

So it’s a CSV.jl issue? I thought it would be a DataFrames.jl issue since the problem occurs when trying to transform the data because it has been compressed by CSV.

pdeffebach · August 6, 2021, 7:27pm

It’s a SentinalArrays issue, and therefore a CSV issue I think. Note that ByRow(f)(x) is just going to do f.(x), so ByRow isn’t doing anything unique with this.

hdavid16 · August 6, 2021, 7:57pm

It actually has to do with PooledArrays. Since these are compressed, I’m guessing all missing items are compressed to the same object, which is why they get replaced with the same Symbol with gensym().

So maybe pool=false should be the default, but perhaps there are other (better) reasons why pool=true is the default.

bkamins · August 6, 2021, 7:57pm

This is a DataFrames.jl/PooledArrays.jl issue. It is tracked in https://github.com/JuliaData/PooledArrays.jl/issues/63. I have opened https://github.com/JuliaData/DataFrames.jl/issues/2834 to make sure it is resolved soon in DataFrames.jl.

The reason is https://github.com/JuliaData/PooledArrays.jl/blob/35ecfd186c5e0f1aba1fc278e93766f3258f9cc3/src/PooledArrays.jl#L307

Topic		Replies	Views
Error when combining single row with multiple row CSV file into DataFrames Data dataframes , csv	6	178	March 15, 2024
CSV.read: why do String columns show up as PooledArrays? New to Julia question	6	1106	October 30, 2019
Clarification on when order matters when reading multiple files with CSV.read? New to Julia csv	4	548	April 15, 2024
Possible bug in dropmissing! General Usage	7	1153	June 4, 2019
Append!() with two dataframe throws PooledArray error General Usage	4	440	June 3, 2019

DataFrames: ByRow fails in transform with PooledArrays after CSV.read

Related topics