Arrow changes a DataFrame column from type `Float32` to `Float32?` without missing values?

phantom · May 18, 2023, 7:23am

I’m sure I’m overlooking something simple here but is there a reason why Arrow would convert a Float32 column in a Dataframe df to type Float32? when

all(x->typeof(x) == Float32, df.col)
true

and

any(ismissing.(df.col))
false

any(isnan.(df.col))
false

I have a very large GroupedDataFrame with Float32 columns but for some reason when I save them as arrow tables with something like

for df = gdf 
Arrow.append("file path", df)
end

Some columns are converted to type Float32?. There are no missing or -0.0 values in any of the columns so I’m wondering if there’s anything else that might trigger this conversion? Thanks!

nilshg · May 18, 2023, 8:51am

A reproducer including versioninfo() and ]st would be good, I can’t reproduce naively:

julia> using Arrow, DataFrames

julia> df = DataFrame(group = rand('a':'d', 100), val = rand(Float32, 100))
100×2 DataFrame
 Row │ group  val
     │ Char   Float32
─────┼───────────────────
   1 │ c      0.439414
   2 │ b      0.625717

julia> gdf = groupby(df, :group);

julia> for df ∈ gdf
           Arrow.append("test.arrow", df)
       end

julia> DataFrame(Arrow.Table("test.arrow"))
100×2 DataFrame
 Row │ group  val
     │ Char   Float32
─────┼───────────────────
   1 │ c      0.439414
   2 │ c      0.402496

with Julia 1.9 and

(jl_q61HKQ) pkg> st
Status `\Temp\jl_q61HKQ\Project.toml`
  [69666777] Arrow v2.5.2
  [a93c6f00] DataFrames v1.5.0

phantom · May 18, 2023, 11:02am

Thanks for this helpful insight! Sorry I thought I couldn’t reproduce the error in smaller examples maybe due to the large size of the actual file. It seems though that creating new columns by looping through a GroupedDataFrame creates columns of Type? even when the inputs and outputs are all of uniform type.

DF = DataFrame(ID = ["a", "a", "a", "b","b","b"], start = rand(Float32,6).* 20 .-10  , stop = rand(Float32, 6).*20 .-10) ; 

gdf = groupby(DF, :ID) ;
 
for df = gdf 
       df.d = df.start -df.stop
       end

julia> gdf
GroupedDataFrame with 2 groups based on key: ID
First Group (3 rows): ID = "a"
 Row │ ID      start      stop     d         
     │ String  Float32    Float32  Float32?   # d's type changed even though no missing values
─────┼───────────────────────────────────────
   1 │ a        2.41753   5.20046   -2.78293
   2 │ a       -9.18899   9.00296  -18.1919
   3 │ a        0.750689  3.39335   -2.64267
⋮
Last Group (3 rows): ID = "b"
 Row │ ID      start     stop      d        
     │ String  Float32   Float32   Float32? 
─────┼──────────────────────────────────────
   1 │ b        1.77128   3.74756  -1.97628
   2 │ b        2.77554   4.27606  -1.50052
   3 │ b       -8.09931  -5.81249  -2.28682

Whereas without looping the resulting column stays Float32

DF.d = DF.start - DF.stop ; 

 DF
6×4 DataFrame
 Row │ ID      start      stop      d         
     │ String  Float32    Float32   Float32   
─────┼────────────────────────────────────────
   1 │ a        2.41753    5.20046   -2.78293
   2 │ a       -9.18899    9.00296  -18.1919
   3 │ a        0.750689   3.39335   -2.64267
   4 │ b        1.77128    3.74756   -1.97628
   5 │ b        2.77554    4.27606   -1.50052
   6 │ b       -8.09931   -5.81249   -2.28682

or

transform!(DF, [:start, :stop] => ((x,y) -> x-y) => :d)
6×4 DataFrame
 Row │ ID      start      stop      d         
     │ String  Float32    Float32   Float32   
─────┼────────────────────────────────────────
   1 │ a        2.41753    5.20046   -2.78293
   2 │ a       -9.18899    9.00296  -18.1919
   3 │ a        0.750689   3.39335   -2.64267
   4 │ b        1.77128    3.74756   -1.97628
   5 │ b        2.77554    4.27606   -1.50052
   6 │ b       -8.09931   -5.81249   -2.28682

I guess this is because as each SubDataFrame in the GroupedDataFrame is being populated loop, the remaining SubDataFrames are filled with ‘missing’ values as temporary holding places? Is there a work around for this without making another copy of gdf if looping through the GDF is desirable?

I am running Julia Version 1.8.5 with Arrow v2.5.2 and DataFrames v1.5.0

jules · May 18, 2023, 11:44am

Create a placeholder column first, then overwrite it in the loop.

DrChainsaw · May 19, 2023, 9:27pm

Another option might be combine with ungroup=false:

julia> combine(groupby(DF, :ID), All(), [:start, :stop] => ((start, stop) -> start .- stop) => :d; ungroup=false)
GroupedDataFrame with 2 groups based on key: ID
First Group (3 rows): ID = "a"
 Row │ ID      start     stop       d
     │ String  Float32   Float32    Float32    
─────┼─────────────────────────────────────────
   1 │ a       -9.21761  -0.134445   -9.08316
   2 │ a       -9.05425   3.46452   -12.5188
   3 │ a       -1.17211  -0.538907   -0.633204
⋮
Last Group (3 rows): ID = "b"
 Row │ ID      start     stop       d
     │ String  Float32   Float32    Float32    
─────┼─────────────────────────────────────────
   1 │ b       -5.62954  -4.84692    -0.782621
   2 │ b       -4.25461   7.19271   -11.4473
   3 │ b        8.72539   0.995859    7.72953

phantom · May 22, 2023, 2:03pm

thanks so much! I didn’t realize ungroup was an option.

Topic		Replies	Views
Why does Appending a Dataframe to an Arrow file Change the column type? New to Julia question , arrow	1	121	June 9, 2024
Problem with numerical data in Arrow.jl New to Julia arrow	1	67	July 19, 2024
DataFrames: convert column data type Data type , dataframes , convert	65	28279	May 11, 2023
DataFrames type assignment inconsistency General Usage	5	383	November 26, 2018
String31 in dataframe New to Julia question , dataframes	4	685	July 24, 2023

Arrow changes a DataFrame column from type `Float32` to `Float32?` without missing values?

Related topics