Does Arrow.jl support enums?

I’m surprised that the code below doesn’t work.

@enum MyEnum LALA1 LALA2 LALA3
x = [LALA1, LALA1, LALA3, LALA2]
table = (cols1=x,)
io = IOBuffer()
Arrow.write(io, table)
ERROR: MethodError: no method matching arrowtype(::Arrow.FlatBuffers.Builder, ::Type{MyEnum})
Closest candidates are:
  arrowtype(::Any, ::Union{Arrow.DenseUnion{S, Arrow.UnionT{T, typeIds, U}}, Arrow.SparseUnion{S, Arrow.UnionT{T, typeIds, U}}}) where {S, T, typeIds, U} at ~/.julia/packages/Arrow/ZlMFU/src/eltypes.jl:484

It’s possible that enums by default require some interface method defined on them, but searching the docs for “enum” returns nothing.

Ok so this does something

ArrowTypes.ArrowKind(::Type{MyEnum}) = ArrowTypes.DictEncodedKind
Arrow.write(io, table)  # IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=256, maxsize=Inf, ptr=257, mark=-1)

However, I still can’t read it back

seekstart(io)
table2 = Arrow.Table(io)  # ERROR: type Nothing has no field fields

I also can’t make sense of the data that was actually written to the buffer

reinterpret(Int8, io.data)

I expected that the first 4 elements of this would be [0, 0, 2, 1], but it’s all -1.

ENUM are just integers, just use the values:

julia> @enum Fruit a b c

julia> Integer.([a,b,c])
3-element Vector{Int32}:
 0
 1
 2

# when you read back from Arrow, re-make the enums
julia> Fruit.(Integer.([a,b,c]))
3-element Vector{Fruit}:
 a::Fruit = 0
 b::Fruit = 1
 c::Fruit = 2

I understand that, but I don’t want to do that. I want to do Arrow.Table(io) and get my objects properly deserialized. Of course I can cast manually, but that’s not what I want.

Here’s another attempt

@enum MyEnum LALA1 LALA2 LALA3
ArrowTypes.arrowname(::Type{MyEnum}) = :MyEnum
ArrowTypes.JuliaType(::Val{:MyEnum}) = MyEnum
ArrowTypes.ArrowKind(::Type{MyEnum}) = ArrowTypes.PrimitiveKind
ArrowTypes.toarrow(::Type{MyEnum}) = Int
ArrowTypes.fromarrow(::Type{MyEnum}, x) = MyEnum(x)  # shouldn't be necessary as this is the default
x = [LALA1, LALA1, LALA3, LALA2]
table = (cols1=x,)
io = IOBuffer()
Arrow.write(io, table)
seekstart(io)
table2 = Arrow.Table(io)  # correctly reports 4 rows
table2.cols1

The last line returns

ERROR: MethodError: no method matching MyEnum()
Closest candidates are:
  MyEnum(::Integer) at Enums.jl:197

I don’t understand why it’s calling MyEnum(). I expected it t call MyEnum(x) where x is an Int it reads from the buffer.

then you get what you get

How do you propose that Arrow.jl will know that the file contains integers that refer to a particular Enum and not just integers that refer to numbers?

https://arrow.apache.org/docs/python/api/datatypes.html

seems to be the data types Arrow understands.

using Arrow

@enum MyEnum LALA1 LALA2 LALA3

ArrowTypes.arrowname(::Type{MyEnum}) = :MyEnum
ArrowTypes.JuliaType(::Val{:MyEnum}) = MyEnum
ArrowTypes.ArrowKind(::Type{MyEnum}) = ArrowTypes.PrimitiveKind
ArrowTypes.ArrowType(::Type{MyEnum}) = Int32

x = [LALA1, LALA1, LALA3, LALA2]
table = (cols1=x,)
io = IOBuffer()
Arrow.write(io, table)
seekstart(io)
table2 = Arrow.Table(io)  # correctly reports 4 rows
table2.cols1

ArrowTypes.ArrowType seems to have been the missing piece, the above works returning:

4-element Arrow.Primitive{Main.AE.MyEnum, Vector{Int32}}:
 LALA1::MyEnum = 0
 LALA1::MyEnum = 0
 LALA3::MyEnum = 2
 LALA2::MyEnum = 1
3 Likes

On the whole, the interface is not super intuitive. The docs don’t say what valid ArrowTypes.ArrowType values are, among other things. They mention "natively supported arrow type"s, but from the linked Apache Arrow documentation it’s not easy to find what those are either.

They do say

This stuff can definitely make your eyes glaze over if you stare at it long enough. As always, don’t hesitate to reach out for quick questions on the #data slack channel, or open a new issue detailing what you’re trying to do.

so it seems a good idea to actually open an issue to ask about the best way to serialize Julia Enums with Arrow (just in case the above is missing something), and to also ask for better clarification of this part of the documentation.

Thank you, this is exactly what I was looking for.

Yes, 100% agree. I found the docs are quite dense. And it’s surprising that such a simple case like enums isn’t an example, or even come implemented by default.

Yeah, we could probably add default support for Enums, which would basically be the solution by @digital_carver, but for any Enum subtype. If someone is up for making a PR with a couple of tests, I’d appreciate it! Otherwise, if someone wants to open an issue, I can try to get to it soon.

1 Like