Arrow's DictEncode to CategoricalArray?

elenev · April 3, 2024, 5:37am

I am using Arrow.jl to store a dataframe to disk. The dataframe has some columns of CategoricalArray{T} type, where T is either String or Int64.

When I read the dataframe back in, the categorical-ness of these columns is not preserved, even though the Arrow.Table object does have those columns correctly marked as DictEncode’d?

Mechanically, I think this is because the DataFrames constructor works on any object inheriting the Tables.jl interface, whereas Arrow.DictEncode is an Arrow-specific implementation detail that the constructor is not aware of.

Would it be possible to generically convert such columns back to categorical arrays or would this require assumptions about the data (i.e., because multiple original Julia types can get Arrow-serialized with the DictEncode property)?

If not, I can still do the conversions back to CategoricalArray manually, but I notice that Base.summarysize() does not decrease and remains considerably larger than the Base.summarysize() of the original dataframe, before serialization.

EDIT: Actually, I just realized that the larger memory of the imported dataframe is due to all columns being much bigger in size, not the categoricals. So the previous paragraph doesn’t apply. This can be fixed by just converting them to Vectors from Arrow.Primitive{T, Vector{T}}, which is what they get read as. Hmm – why?

elenev · April 3, 2024, 3:51pm

OP here.

I figured out the size issue, I think. copy()ing the imported dataframe using df = Arrow.Table(path) |> DataFrame |> copy shrinks Base.summarysize() back to the size of the original dataframe. Now that I think about it, I’m not even sure what Base.summarysize(df)'s output means for df = Arrow.Table(path) |> DataFrame since df’s columns are just views of data still on disk, if I understand correctly. So if df won’t be modified, it’s probably best not to worry about Base.summarysize()'s output. If df will be modified, then copy()ing is useful.

My original question – about preserving CategoricalArray types – remains. If it turns out to be impossible with Arrow, is there another lossless (i.e., not CSV) and stable (i.e., not generic serialization) method for storing dataframes on disk? Something akin to pickle in pandas?

dmbates · April 3, 2024, 5:32pm

As I understand it, you are doing something like

julia> using Arrow, CategoricalArrays, DataFrames

julia> df = DataFrame(a = 1:4, b = string.('a':'d'), c = categorical(["x", "x", "y", "y"]))
4×3 DataFrame
 Row │ a      b       c    
     │ Int64  String  Cat… 
─────┼─────────────────────
   1 │     1  a       x
   2 │     2  b       x
   3 │     3  c       y
   4 │     4  d       y

julia> afn = Arrow.write("./df.arrow", df)
"./df.arrow"

julia> df1 = DataFrame(Arrow.Table(afn))
4×3 DataFrame
 Row │ a      b       c      
     │ Int64  String  String 
─────┼───────────────────────
   1 │     1  a       x
   2 │     2  b       x
   3 │     3  c       y
   4 │     4  d       y

julia> typeof(df1.c)
Arrow.DictEncoded{String, Int8, Arrow.List{String, Int32, Vector{UInt8}}}

It won’t be the case that you can “round trip” DataFrame → Arrow → DataFrame and get the same types. Is there a reason that you need a CategoricalArray instead of the Arrow.DictEncoded result. The Arrow.DictEncoded result can in some circumstances take up less storage than the CategoricalArray, because it uses the smallest signed integer type available for the refarray (Int8 in this case).

Arrow.DictEncoded is more like a PooledArray than a CategoricalArray but often the distinctions are not important. They can be important for ordered categorical arrays. I think it is still the case that the Arrow.Table function does ignores whether DictEncoded arrays in the Arrow file have ordered categories.

elenev · April 3, 2024, 5:38pm

I think it IS still the case because DataFrame(afn::Arrow.Table) dispatches DataFrame(::Table). This is elegant composition because it means that as long as your package can get data in a tabular form, you don’t need to provide a DataFrame constructor for it. But it does mean that DataFrame won’t be aware of any format-specific metadata, such as Arrow’s DictEncoded. I think?

As to your first question, I’m using CategoricalArray because order matters for some applications, and because some of the statistical analysis I plan on doing with the data needs them to be categorical. But I suppose your implication is correct – the choice of formats for storing data should be mainly driven by performance and size considerations. Then once I start doing analysis, I can make whatever in-memory conversions are appropriate.

dmbates · April 3, 2024, 5:43pm

If you use the formula/data syntax provided by StatisticalModels.jl and specify the contrasts to be used for your categorical columns, you can impose the order there. Often you just need to specify the base level in the contrasts, e.g.

contrasts = Dict(:c => EffectsCoding(base="y"))

That formula/data syntax is used by GLM.jl and MixedModels.jl

nalimilan · April 3, 2024, 8:26pm

If you want to save and reload data frames without losing any type information, you can use JLD2.jl. Otherwise you’ll have to recreate CategoricalArrays manually like this:

using DataAPI
mapcols!(df1) do col
     col isa Arrow.DictEncoded ? levels!(categorical(col), DataAPI.refpool(col)) : col
end

elenev · April 3, 2024, 8:27pm

Is JLD2 serialization compatible across (minor) version changes of Julia or the JLD2 package?

nalimilan · April 3, 2024, 9:23pm

Yes, it should. Though it could break if the layout of the DataFrame, CategoricalArray or another type used in the data changes.

It’s kind of unfortunate that reloading data from Arrow is tricky for CategoricalArray, as Arrow is a stable format with all the necessary functionality under the hood. Maybe we could have an argument allowing to do this a bit more easily when reloading an Arrow file.

palday · April 4, 2024, 7:47pm

This feels like a great case for using ArrowTypes and a package extension to CategoricalArrays.

elenev · April 4, 2024, 10:27pm

@palday Can you explain this in a bit more detail?

nalimilan · April 5, 2024, 7:55am

Indeed, thanks for the pointer. Though I’ve experimented a bit and I couldn’t get Arrow to generate a CategoricalArray when loading, only a Vector{<:CategoricalValue}. The problem is that ArrowTypes methods are defined for scalars (CategoricalValue here), but I couldn’t find a way to have it operate at the array level. Any ideas?

nalimilan · April 7, 2024, 2:09pm

After some additional investigation, I’ve found code that gives the intended result. But it’s a bit hacky as I need to override the DictEncoding constructor so that it stores CategoricalValue objects. Otherwise they would have to be created on the fly for each element and they couldn’t share the same pool. Maybe a more generic API could be added to Arrow to make this cleaner @quinnj?

using Arrow, ArrowTypes, CategoricalArrays, DataFrames

ArrowTypes.ArrowType(::Type{<:CategoricalValue}) = Arrow.DictEncoded

ArrowTypes.arrowname(::Type{<:CategoricalValue}) = Symbol("JuliaLang.CategoricalArray")
ArrowTypes.arrowmetadata(::Type{CategoricalValue{T, R}}) where {T, R} = string(R)

const REFTYPES = Dict(string(T) => T for T in (Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8))
function ArrowTypes.JuliaType(::Val{Symbol("JuliaLang.CategoricalArray")}, ::Type{S}, meta::String) where S
    R = REFTYPES[meta]
    return CategoricalValue{S, R}
end

function Arrow.DictEncoding{V,S,A}(id, data::Arrow.List{U, O, B}, isOrdered, metadata) where {T, V<:CategoricalValue{T}, S, O, A, B, U}
    newdata = Arrow.List{T, O, B}(data.arrow, data.validity, data.offsets, data.data, data.ℓ, data.metadata)
    catdata = CategoricalVector{T}(newdata, levels=newdata)
    return Arrow.DictEncoding{V,S,typeof(catdata)}(id, catdata, isOrdered, metadata)
end

function Arrow.DictEncoding{V,S,A}(id, data::Arrow.Primitive{U, B}, isOrdered, metadata) where {T, V<:CategoricalValue{T}, S, A, B, U}
    newdata = Arrow.Primitive{T, B}(data.arrow, data.validity, data.data, data.ℓ, data.metadata)
    catdata = CategoricalVector{T}(newdata, levels=newdata)
    return Arrow.DictEncoding{V,S,typeof(catdata)}(id, catdata, isOrdered, metadata)
end

function Base.copy(x::Arrow.DictEncoded{V}) where {T, R, V<:CategoricalValue{T, R}}
    pool = CategoricalArrays.CategoricalPool{T, R}(x.encoding.data)
    inds = x.indices
    refs = similar(inds, R)
    refs .= inds .+ one(R)
    return CategoricalVector{T}(refs, pool)
end

palday · April 10, 2024, 5:14pm

This is pretty much exactly what I was expecting.

nalimilan · April 17, 2024, 12:36pm

And do you think it’s clear enough to add to CategoricalArrays?

nikolays · February 7, 2025, 2:15pm

Awesome! Thank you very much. It worked as needed, and it should be pushed upstream to Arrow.jl or elsewhere appropriate.

The file written by Arrow.jl and @nalimilan code was successfully loaded in R with arrow::read_ipc_file function. In R, the categorical column became the factor column out of the box with the same level order.

Even if categories are not true ordinal, the order can be important for printing purposes.

nikolays · February 7, 2025, 3:36pm

Need to add handling of missing data to make it perfect. Here is the test code, which is not working:

# nalimilan addition
using Arrow, ArrowTypes, CategoricalArrays, DataFrames

ArrowTypes.ArrowType(::Type{<:CategoricalValue}) = Arrow.DictEncoded

ArrowTypes.arrowname(::Type{<:CategoricalValue}) = Symbol("JuliaLang.CategoricalArray")
ArrowTypes.arrowmetadata(::Type{CategoricalValue{T, R}}) where {T, R} = string(R)

const REFTYPES = Dict(string(T) => T for T in (Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8))
function ArrowTypes.JuliaType(::Val{Symbol("JuliaLang.CategoricalArray")}, ::Type{S}, meta::String) where S
    R = REFTYPES[meta]
    return CategoricalValue{S, R}
end

function Arrow.DictEncoding{V,S,A}(id, data::Arrow.List{U, O, B}, isOrdered, metadata) where {T, V<:CategoricalValue{T}, S, O, A, B, U}
    newdata = Arrow.List{T, O, B}(data.arrow, data.validity, data.offsets, data.data, data.ℓ, data.metadata)
    catdata = CategoricalVector{T}(newdata, levels=newdata)
    return Arrow.DictEncoding{V,S,typeof(catdata)}(id, catdata, isOrdered, metadata)
end

function Arrow.DictEncoding{V,S,A}(id, data::Arrow.Primitive{U, B}, isOrdered, metadata) where {T, V<:CategoricalValue{T}, S, A, B, U}
    newdata = Arrow.Primitive{T, B}(data.arrow, data.validity, data.data, data.ℓ, data.metadata)
    catdata = CategoricalVector{T}(newdata, levels=newdata)
    return Arrow.DictEncoding{V,S,typeof(catdata)}(id, catdata, isOrdered, metadata)
end

function Base.copy(x::Arrow.DictEncoded{V}) where {T, R, V<:CategoricalValue{T, R}}
    pool = CategoricalArrays.CategoricalPool{T, R}(x.encoding.data)
    inds = x.indices
    refs = similar(inds, R)
    refs .= inds .+ one(R)
    return CategoricalVector{T}(refs, pool)
end

function Base.copy(x::Arrow.DictEncoded{Union{Missing, V}}) where {T, R, V<:CategoricalValue{T, R}}
    @info "lets try"
    pool = CategoricalArrays.CategoricalPool{T, R}(x.encoding.data)
    inds = x.indices
    refs = similar(inds, R)
    refs .= inds .+ one(R)
    return CategoricalVector{T}(refs, pool)
end

# test code
df = DataFrame(
    col1 = categorical(["A","B","C","A","A","C", "C"], ordered=false,compress=true),
    col2 = categorical(["A","B","C","A","A",missing, "C"], ordered=false,compress=true)
)

Arrow.write("df.arrow", df, compress=:zstd)
atab = Arrow.Table("df.arrow")
df2 = DataFrame(atab; copycols=true) # failed here on copy of col2

nalimilan · February 7, 2025, 10:43pm

This code seems to work, but I haven’t checked it carefully yet and it’s not complete. I should really get this into CategoricalArrays or Arrow.jl.

const CATARRAY_ARROWNAME = Symbol("JuliaLang.CategoricalArray")
ArrowTypes.arrowname(::Type{<:CategoricalValue}) = CATARRAY_ARROWNAME
ArrowTypes.arrowmetadata(::Type{CategoricalValue{T, R}}) where {T, R} = string(R)

ArrowTypes.arrowname(::Type{Union{<:CategoricalValue, Missing}}) = CATARRAY_ARROWNAME
ArrowTypes.arrowmetadata(::Type{Union{CategoricalValue{T, R}, Missing}}) where {T, R} = string(R)

const REFTYPES = Dict(string(T) => T for T in (Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8))
function ArrowTypes.JuliaType(::Val{Symbol("JuliaLang.CategoricalArray")}, ::Type{S}, meta::String) where S
    R = REFTYPES[meta]
    return CategoricalValue{S, R}
end

function Arrow.DictEncoding{V,S,A}(id, data::Arrow.List{U, O, B}, isOrdered, metadata) where {T, R, V<:CategoricalValue{T,R}, S, O, A, B, U}
    newdata = Arrow.List{T, O, B}(data.arrow, data.validity, data.offsets, data.data, data.ℓ, data.metadata)
    catdata = CategoricalVector{T,R}(newdata, levels=newdata)
    return Arrow.DictEncoding{V,S,typeof(catdata)}(id, catdata, isOrdered, metadata)
end

function Arrow.DictEncoding{V,S,A}(id, data::Arrow.Primitive{U, B}, isOrdered, metadata) where {T, R, V<:CategoricalValue{T,R}, S, A, B, U}
    newdata = Arrow.Primitive{T, B}(data.arrow, data.validity, data.data, data.ℓ, data.metadata)
    catdata = CategoricalVector{T,R}(newdata, levels=newdata)
    return Arrow.DictEncoding{V,S,typeof(catdata)}(id, catdata, isOrdered, metadata)
end

function Arrow.DictEncoding{Union{Missing,V},S,A}(id, data::Arrow.List{U, O, B}, isOrdered, metadata) where {T, R, V<:CategoricalValue{T,R}, S, O, A, B, U}
    newdata = Arrow.List{Union{Missing,T}, O, B}(data.arrow, data.validity, data.offsets, data.data, data.ℓ, data.metadata)
    levels = collect(skipmissing(newdata))
    catdata = CategoricalVector{Union{Missing,T},R}(newdata, levels=levels)
    return Arrow.DictEncoding{Union{Missing,V},S,typeof(catdata)}(id, catdata, isOrdered, metadata)
end

function Arrow.DictEncoding{Union{Missing,V},S,A}(id, data::Arrow.Primitive{U, B}, isOrdered, metadata) where {T, R, V<:CategoricalValue{T,R}, S, A, B, U}
    newdata = Arrow.Primitive{Union{Missing,T}, B}(data.arrow, data.validity, data.data, data.ℓ, data.metadata)
    levels = collect(skipmissing(newdata))
    catdata = CategoricalVector{Union{Missing,T},R}(newdata, levels=levels)
    return Arrow.DictEncoding{Union{Missing,V},R,typeof(catdata)}(id, catdata, isOrdered, metadata)
end


function Base.copy(x::Arrow.DictEncoded{V}) where {T, R, V<:CategoricalValue{T, R}}
    pool = CategoricalArrays.CategoricalPool{T, R}(x.encoding.data)
    inds = x.indices
    refs = similar(inds, R)
    refs .= inds .+ one(R)
    return CategoricalVector{T}(refs, pool)
end

function Base.copy(x::Arrow.DictEncoded{Union{Missing, V}}) where {T, R, V<:CategoricalValue{T, R}}
    levels = collect(skipmissing(x.encoding.data))
    pool = CategoricalArrays.CategoricalPool{T, R}(levels)
    inds = x.indices
    refs = similar(inds, R)
    if ismissing(x.encoding.data[1])
        refs .= inds
    elseif ismissing(x.encoding.data[end])
        n = length(x.encoding.data) - 1
        refs .= ifelse.(inds .== n, zero(R), inds .+ one(R))
    else
        throw(ErrorException("not implemented"))
    end
    return CategoricalVector{Union{Missing,T}}(refs, pool)
end

EDIT: I’ve improved the implementation a bit

nikolays · February 10, 2025, 5:33pm

That is excellent and is working within Julia as expected. And R read it as well.

However, during the attempt to read it in Python:

import pyarrow
import pyarrow.ipc
import pandas as pd

# following is an error ("Categorical categories cannot be null")
with pyarrow.ipc.open_file("df.arrow") as reader:
    df2 = reader.read_pandas()

# can read plain arrow format
with pyarrow.ipc.open_file("df.arrow") as reader:
    batches = reader.get_batch(0)

I got “Categorical categories cannot be null”, which is particularly funny as pandas docs said “In contrast to R’s factor function, categorical data is not converting input values to strings”.

Small experimentation with pyarrow shows that for seemless pandas integration the arrow dict should be:

pyarrow.RecordBatch
A: dictionary<values=string, indices=int8, ordered=0>
----
A: -- dictionary:
["AA","Bb"]-- indices:
[null,null,1,null,0]

and not

pyarrow.RecordBatch
col1: dictionary<values=string, indices=int8, ordered=0> not null
col2: dictionary<values=string, indices=int8, ordered=0>
----
col1: -- dictionary:
["A","B","C"]-- indices:
[0,1,2,0,0,2,2]
col2: -- dictionary:
[null,"A","B","C"]-- indices:
[1,2,3,1,1,null,3]

That is null should be outsize of indices

nalimilan · February 11, 2025, 2:30pm

Interesting. AFAICT both representations are allowed (and pyarrow allows choosing which one you want). I don’t think Arrow.jl allows this currently though, it even has a hack to add missing to the pool for CategoricalArrays:

github.com/apache/arrow-julia

src/arraytypes/dictencoding.jl

c12899b97


      
              # need to encode ourselves
              x = PooledArray(x; signed=true, compress=true)
              inds = DataAPI.refarray(x)
              pool = DataAPI.refpool(x)
          else
              pool = DataAPI.refpool(x)
              refa = DataAPI.refarray(x)
              inds = copyto!(similar(Vector{signedtype(length(pool))}, length(refa)), refa)
          end
          # horrible hack? yes. better than taking CategoricalArrays dependency? also yes.
          if typeof(pool).name.name == :CategoricalRefPool
              if eltype(x) >: Missing
                  pool = vcat(missing, DataAPI.levels(x))
              else
                  pool = DataAPI.levels(x)
                  for i = 1:length(inds)
                      @inbounds inds[i] -= 1
                  end
              end
          else
              # adjust to "offset" instead of index

AFAIK having missing in the dictionary is more efficient, as least in Julia it avoids allocating additional space to track missing values.

Could be worth filing an issue against pandas.

Topic		Replies	Views
Using Arrow.DictEncode() Data dataframes , arrow	1	673	June 30, 2021
Convert DataFrames 1.3 DataFrame to DataFrames 1.4 General Usage dataframes	14	497	November 2, 2022
Arrow changes a DataFrame column from type `Float32` to `Float32?` without missing values? New to Julia question , dataframes , arrow	5	292	May 22, 2023
Help with Arrow.jl and size of files Data question , arrow	23	1888	October 21, 2022
CSV.read: why do String columns show up as PooledArrays? New to Julia question	6	1105	October 30, 2019

Arrow's DictEncode to CategoricalArray?

Related topics