I have a very large GroupedDataFrame with Float32 columns but for some reason when I save them as arrow tables with something like
for df = gdf
Arrow.append("file path", df)
end
Some columns are converted to type Float32?. There are no missing or -0.0 values in any of the columns so I’m wondering if there’s anything else that might trigger this conversion? Thanks!
Thanks for this helpful insight! Sorry I thought I couldn’t reproduce the error in smaller examples maybe due to the large size of the actual file. It seems though that creating new columns by looping through a GroupedDataFrame creates columns of Type? even when the inputs and outputs are all of uniform type.
DF = DataFrame(ID = ["a", "a", "a", "b","b","b"], start = rand(Float32,6).* 20 .-10 , stop = rand(Float32, 6).*20 .-10) ;
gdf = groupby(DF, :ID) ;
for df = gdf
df.d = df.start -df.stop
end
julia> gdf
GroupedDataFrame with 2 groups based on key: ID
First Group (3 rows): ID = "a"
Row │ ID start stop d
│ String Float32 Float32 Float32? # d's type changed even though no missing values
─────┼───────────────────────────────────────
1 │ a 2.41753 5.20046 -2.78293
2 │ a -9.18899 9.00296 -18.1919
3 │ a 0.750689 3.39335 -2.64267
⋮
Last Group (3 rows): ID = "b"
Row │ ID start stop d
│ String Float32 Float32 Float32?
─────┼──────────────────────────────────────
1 │ b 1.77128 3.74756 -1.97628
2 │ b 2.77554 4.27606 -1.50052
3 │ b -8.09931 -5.81249 -2.28682
Whereas without looping the resulting column stays Float32
DF.d = DF.start - DF.stop ;
DF
6×4 DataFrame
Row │ ID start stop d
│ String Float32 Float32 Float32
─────┼────────────────────────────────────────
1 │ a 2.41753 5.20046 -2.78293
2 │ a -9.18899 9.00296 -18.1919
3 │ a 0.750689 3.39335 -2.64267
4 │ b 1.77128 3.74756 -1.97628
5 │ b 2.77554 4.27606 -1.50052
6 │ b -8.09931 -5.81249 -2.28682
or
transform!(DF, [:start, :stop] => ((x,y) -> x-y) => :d)
6×4 DataFrame
Row │ ID start stop d
│ String Float32 Float32 Float32
─────┼────────────────────────────────────────
1 │ a 2.41753 5.20046 -2.78293
2 │ a -9.18899 9.00296 -18.1919
3 │ a 0.750689 3.39335 -2.64267
4 │ b 1.77128 3.74756 -1.97628
5 │ b 2.77554 4.27606 -1.50052
6 │ b -8.09931 -5.81249 -2.28682
I guess this is because as each SubDataFrame in the GroupedDataFrame is being populated loop, the remaining SubDataFrames are filled with ‘missing’ values as temporary holding places? Is there a work around for this without making another copy of gdf if looping through the GDF is desirable?
I am running Julia Version 1.8.5 with Arrow v2.5.2 and DataFrames v1.5.0
Another option might be combine with ungroup=false:
julia> combine(groupby(DF, :ID), All(), [:start, :stop] => ((start, stop) -> start .- stop) => :d; ungroup=false)
GroupedDataFrame with 2 groups based on key: ID
First Group (3 rows): ID = "a"
Row │ ID start stop d
│ String Float32 Float32 Float32
─────┼─────────────────────────────────────────
1 │ a -9.21761 -0.134445 -9.08316
2 │ a -9.05425 3.46452 -12.5188
3 │ a -1.17211 -0.538907 -0.633204
⋮
Last Group (3 rows): ID = "b"
Row │ ID start stop d
│ String Float32 Float32 Float32
─────┼─────────────────────────────────────────
1 │ b -5.62954 -4.84692 -0.782621
2 │ b -4.25461 7.19271 -11.4473
3 │ b 8.72539 0.995859 7.72953