Using Arrow.DictEncode()

jbshannon · June 29, 2021, 10:20pm

How exactly does the Arrow.DictEncode() method work? I’ve tried using it to signal that a DataFrame column be dictionary encoding when writing to a .arrow file, but I can’t see any indication that this works. I’ve looked through the documentation, but it’s fairly sparse right now. Here’s a MWE:

using Arrow, DataFrames
df = repeat(DataFrame(a=1:10, b=["Very long string $i" for i in 1:10]), 10)
Arrow.DictEncode(df.b)
Arrow.write("test.arrow", df)
tbl = Arrow.Table("test.arrow")

The resulting table looks like this:

julia> tbl = Arrow.Table("test.arrow")
Arrow.Table with 100 rows, 2 columns, and schema:
 :a  Int64
 :b  String

and the type does not appear to have changed:

julia> typeof(Arrow.DictEncode(tbl.b))
Arrow.DictEncode{String,Arrow.List{String,Int32,Array{UInt8,1}}}

julia> typeof(tbl.b)
Arrow.List{String,Int32,Array{UInt8,1}}

Any help would be much appreciated!

jbshannon · June 30, 2021, 10:45pm

I was able to work this out; the type of the column has to be wrapped in DictEncode before the file is written. My MWE did not replace df.b with the wrapped column, so the df used to write the Arrow table was not dict encoded.

julia> typeof(df.b)
Array{String,1}

julia> df.b = Arrow.DictEncode(df.b);

julia> typeof(df.b)
Arrow.DictEncode{String,Array{String,1}}

The dict encoding works now:

julia> Arrow.write("test.arrow", df)
"test.arrow"

julia> tbl = Arrow.Table("test.arrow")
Arrow.Table with 100 rows, 2 columns, and schema:
 :a  Int64
 :b  String

julia> Arrow.isdictencoded(tbl.b)
true

This can easily be extended to multiple columns using the transform! function:

using Arrow, DataFrames
df = repeat(DataFrame(a=1:10, b=["Very long string $i" for i in 1:10]), 10)
df.c = df.b
toencode = [:b, :c]
transform!(df, toencode .=> Arrow.DictEncode; renamecols=false)
Arrow.write("test.arrow", df)
tbl = Arrow.Table("test.arrow")

Only the chosen columns are DictEncoded:

julia> map(Arrow.isdictencoded, tbl)
3-element Array{Bool,1}:
 0
 1
 1

Topic		Replies	Views
Arrow's DictEncode to CategoricalArray? Data dataframes , arrow	18	1756	February 11, 2025
[ANN] Arrow.jl 0.3 Release Data arrow	21	3174	March 16, 2021
Why does Appending a Dataframe to an Arrow file Change the column type? New to Julia question , arrow	1	121	June 9, 2024
Writing dataframe to arrow format with column metadata Data	6	519	October 6, 2023
Arrow.write on NamedTuple General Usage	2	415	February 3, 2021

Using Arrow.DictEncode()

Related topics