I am using Arrow.jl to store a dataframe to disk. The dataframe has some columns of CategoricalArray{T}
type, where T
is either String
or Int64
.
When I read the dataframe back in, the categorical-ness of these columns is not preserved, even though the Arrow.Table object does have those columns correctly marked as DictEncode’d?
Mechanically, I think this is because the DataFrames constructor works on any object inheriting the Tables.jl interface, whereas Arrow.DictEncode
is an Arrow-specific implementation detail that the constructor is not aware of.
Would it be possible to generically convert such columns back to categorical arrays or would this require assumptions about the data (i.e., because multiple original Julia types can get Arrow-serialized with the DictEncode property)?
If not, I can still do the conversions back to CategoricalArray manually, but I notice that Base.summarysize()
does not decrease and remains considerably larger than the Base.summarysize()
of the original dataframe, before serialization.
EDIT: Actually, I just realized that the larger memory of the imported dataframe is due to all columns being much bigger in size, not the categoricals. So the previous paragraph doesn’t apply. This can be fixed by just converting them to Vectors from Arrow.Primitive{T, Vector{T}}
, which is what they get read as. Hmm – why?