As I understand it, you are doing something like
julia> using Arrow, CategoricalArrays, DataFrames
julia> df = DataFrame(a = 1:4, b = string.('a':'d'), c = categorical(["x", "x", "y", "y"]))
4×3 DataFrame
Row │ a b c
│ Int64 String Cat…
─────┼─────────────────────
1 │ 1 a x
2 │ 2 b x
3 │ 3 c y
4 │ 4 d y
julia> afn = Arrow.write("./df.arrow", df)
"./df.arrow"
julia> df1 = DataFrame(Arrow.Table(afn))
4×3 DataFrame
Row │ a b c
│ Int64 String String
─────┼───────────────────────
1 │ 1 a x
2 │ 2 b x
3 │ 3 c y
4 │ 4 d y
julia> typeof(df1.c)
Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}
It won’t be the case that you can “round trip” DataFrame → Arrow → DataFrame and get the same types. Is there a reason that you need a CategoricalArray
instead of the Arrow.DictEncoded
result. The Arrow.DictEncoded
result can in some circumstances take up less storage than the CategoricalArray
, because it uses the smallest signed integer type available for the refarray
(Int8
in this case).
Arrow.DictEncoded
is more like a PooledArray
than a CategoricalArray
but often the distinctions are not important. They can be important for ordered categorical arrays. I think it is still the case that the Arrow.Table
function does ignores whether DictEncoded
arrays in the Arrow file have ordered categories.