Arrow's DictEncode to CategoricalArray?

dmbates · April 3, 2024, 5:32pm

As I understand it, you are doing something like

julia> using Arrow, CategoricalArrays, DataFrames

julia> df = DataFrame(a = 1:4, b = string.('a':'d'), c = categorical(["x", "x", "y", "y"]))
4×3 DataFrame
 Row │ a      b       c    
     │ Int64  String  Cat… 
─────┼─────────────────────
   1 │     1  a       x
   2 │     2  b       x
   3 │     3  c       y
   4 │     4  d       y

julia> afn = Arrow.write("./df.arrow", df)
"./df.arrow"

julia> df1 = DataFrame(Arrow.Table(afn))
4×3 DataFrame
 Row │ a      b       c      
     │ Int64  String  String 
─────┼───────────────────────
   1 │     1  a       x
   2 │     2  b       x
   3 │     3  c       y
   4 │     4  d       y

julia> typeof(df1.c)
Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}

It won’t be the case that you can “round trip” DataFrame → Arrow → DataFrame and get the same types. Is there a reason that you need a CategoricalArray instead of the Arrow.DictEncoded result. The Arrow.DictEncoded result can in some circumstances take up less storage than the CategoricalArray, because it uses the smallest signed integer type available for the refarray (Int8 in this case).

Arrow.DictEncoded is more like a PooledArray than a CategoricalArray but often the distinctions are not important. They can be important for ordered categorical arrays. I think it is still the case that the Arrow.Table function does ignores whether DictEncoded arrays in the Arrow file have ordered categories.

Topic		Replies	Views
Using Arrow.DictEncode() Data dataframes , arrow	1	667	June 30, 2021
Convert DataFrames 1.3 DataFrame to DataFrames 1.4 General Usage dataframes	14	484	November 2, 2022
Arrow changes a DataFrame column from type `Float32` to `Float32?` without missing values? New to Julia question , dataframes , arrow	5	289	May 22, 2023
Help with Arrow.jl and size of files Data question , arrow	23	1864	October 21, 2022
CSV.read: why do String columns show up as PooledArrays? New to Julia question	6	1097	October 30, 2019

Arrow's DictEncode to CategoricalArray?

Related topics