Arrow's DictEncode to CategoricalArray?

I am using Arrow.jl to store a dataframe to disk. The dataframe has some columns of CategoricalArray{T} type, where T is either String or Int64.

When I read the dataframe back in, the categorical-ness of these columns is not preserved, even though the Arrow.Table object does have those columns correctly marked as DictEncode’d?

Mechanically, I think this is because the DataFrames constructor works on any object inheriting the Tables.jl interface, whereas Arrow.DictEncode is an Arrow-specific implementation detail that the constructor is not aware of.

Would it be possible to generically convert such columns back to categorical arrays or would this require assumptions about the data (i.e., because multiple original Julia types can get Arrow-serialized with the DictEncode property)?

If not, I can still do the conversions back to CategoricalArray manually, but I notice that Base.summarysize() does not decrease and remains considerably larger than the Base.summarysize() of the original dataframe, before serialization.

EDIT: Actually, I just realized that the larger memory of the imported dataframe is due to all columns being much bigger in size, not the categoricals. So the previous paragraph doesn’t apply. This can be fixed by just converting them to Vectors from Arrow.Primitive{T, Vector{T}}, which is what they get read as. Hmm – why?

OP here.

I figured out the size issue, I think. copy()ing the imported dataframe using df = Arrow.Table(path) |> DataFrame |> copy shrinks Base.summarysize() back to the size of the original dataframe. Now that I think about it, I’m not even sure what Base.summarysize(df)'s output means for df = Arrow.Table(path) |> DataFrame since df’s columns are just views of data still on disk, if I understand correctly. So if df won’t be modified, it’s probably best not to worry about Base.summarysize()'s output. If df will be modified, then copy()ing is useful.

My original question – about preserving CategoricalArray types – remains. If it turns out to be impossible with Arrow, is there another lossless (i.e., not CSV) and stable (i.e., not generic serialization) method for storing dataframes on disk? Something akin to pickle in pandas?

As I understand it, you are doing something like

julia> using Arrow, CategoricalArrays, DataFrames

julia> df = DataFrame(a = 1:4, b = string.('a':'d'), c = categorical(["x", "x", "y", "y"]))
4×3 DataFrame
 Row │ a      b       c    
     │ Int64  String  Cat… 
─────┼─────────────────────
   1 │     1  a       x
   2 │     2  b       x
   3 │     3  c       y
   4 │     4  d       y

julia> afn = Arrow.write("./df.arrow", df)
"./df.arrow"

julia> df1 = DataFrame(Arrow.Table(afn))
4×3 DataFrame
 Row │ a      b       c      
     │ Int64  String  String 
─────┼───────────────────────
   1 │     1  a       x
   2 │     2  b       x
   3 │     3  c       y
   4 │     4  d       y

julia> typeof(df1.c)
Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}

It won’t be the case that you can “round trip” DataFrame → Arrow → DataFrame and get the same types. Is there a reason that you need a CategoricalArray instead of the Arrow.DictEncoded result. The Arrow.DictEncoded result can in some circumstances take up less storage than the CategoricalArray, because it uses the smallest signed integer type available for the refarray (Int8 in this case).

Arrow.DictEncoded is more like a PooledArray than a CategoricalArray but often the distinctions are not important. They can be important for ordered categorical arrays. I think it is still the case that the Arrow.Table function does ignores whether DictEncoded arrays in the Arrow file have ordered categories.

I think it IS still the case because DataFrame(afn::Arrow.Table) dispatches DataFrame(::Table). This is elegant composition because it means that as long as your package can get data in a tabular form, you don’t need to provide a DataFrame constructor for it. But it does mean that DataFrame won’t be aware of any format-specific metadata, such as Arrow’s DictEncoded. I think?

As to your first question, I’m using CategoricalArray because order matters for some applications, and because some of the statistical analysis I plan on doing with the data needs them to be categorical. But I suppose your implication is correct – the choice of formats for storing data should be mainly driven by performance and size considerations. Then once I start doing analysis, I can make whatever in-memory conversions are appropriate.

If you use the formula/data syntax provided by StatisticalModels.jl and specify the contrasts to be used for your categorical columns, you can impose the order there. Often you just need to specify the base level in the contrasts, e.g.

contrasts = Dict(:c => EffectsCoding(base="y"))

That formula/data syntax is used by GLM.jl and MixedModels.jl

1 Like

If you want to save and reload data frames without losing any type information, you can use JLD2.jl. Otherwise you’ll have to recreate CategoricalArrays manually like this:

using DataAPI
mapcols!(df1) do col
     col isa Arrow.DictEncoded ? levels!(categorical(col), DataAPI.refpool(col)) : col
end
1 Like

Is JLD2 serialization compatible across (minor) version changes of Julia or the JLD2 package?

Yes, it should. Though it could break if the layout of the DataFrame, CategoricalArray or another type used in the data changes.

It’s kind of unfortunate that reloading data from Arrow is tricky for CategoricalArray, as Arrow is a stable format with all the necessary functionality under the hood. Maybe we could have an argument allowing to do this a bit more easily when reloading an Arrow file.

3 Likes

This feels like a great case for using ArrowTypes and a package extension to CategoricalArrays.

@palday Can you explain this in a bit more detail?

Indeed, thanks for the pointer. Though I’ve experimented a bit and I couldn’t get Arrow to generate a CategoricalArray when loading, only a Vector{<:CategoricalValue}. The problem is that ArrowTypes methods are defined for scalars (CategoricalValue here), but I couldn’t find a way to have it operate at the array level. Any ideas?

After some additional investigation, I’ve found code that gives the intended result. But it’s a bit hacky as I need to override the DictEncoding constructor so that it stores CategoricalValue objects. Otherwise they would have to be created on the fly for each element and they couldn’t share the same pool. Maybe a more generic API could be added to Arrow to make this cleaner @quinnj?

using Arrow, ArrowTypes, CategoricalArrays, DataFrames

ArrowTypes.ArrowType(::Type{<:CategoricalValue}) = Arrow.DictEncoded

ArrowTypes.arrowname(::Type{<:CategoricalValue}) = Symbol("JuliaLang.CategoricalArray")
ArrowTypes.arrowmetadata(::Type{CategoricalValue{T, R}}) where {T, R} = string(R)

const REFTYPES = Dict(string(T) => T for T in (Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8))
function ArrowTypes.JuliaType(::Val{Symbol("JuliaLang.CategoricalArray")}, ::Type{S}, meta::String) where S
    R = REFTYPES[meta]
    return CategoricalValue{S, R}
end

function Arrow.DictEncoding{V,S,A}(id, data::Arrow.List{U, O, B}, isOrdered, metadata) where {T, V<:CategoricalValue{T}, S, O, A, B, U}
    newdata = Arrow.List{T, O, B}(data.arrow, data.validity, data.offsets, data.data, data.ℓ, data.metadata)
    catdata = CategoricalVector{T}(newdata, levels=newdata)
    return Arrow.DictEncoding{V,S,typeof(catdata)}(id, catdata, isOrdered, metadata)
end

function Arrow.DictEncoding{V,S,A}(id, data::Arrow.Primitive{U, B}, isOrdered, metadata) where {T, V<:CategoricalValue{T}, S, A, B, U}
    newdata = Arrow.Primitive{T, B}(data.arrow, data.validity, data.data, data.ℓ, data.metadata)
    catdata = CategoricalVector{T}(newdata, levels=newdata)
    return Arrow.DictEncoding{V,S,typeof(catdata)}(id, catdata, isOrdered, metadata)
end

function Base.copy(x::Arrow.DictEncoded{V}) where {T, R, V<:CategoricalValue{T, R}}
    pool = CategoricalArrays.CategoricalPool{T, R}(x.encoding.data)
    inds = x.indices
    refs = similar(inds, R)
    refs .= inds .+ one(R)
    return CategoricalVector{T}(refs, pool)
end
1 Like

This is pretty much exactly what I was expecting.

And do you think it’s clear enough to add to CategoricalArrays?