Using Arrow.DictEncode()

How exactly does the Arrow.DictEncode() method work? I’ve tried using it to signal that a DataFrame column be dictionary encoding when writing to a .arrow file, but I can’t see any indication that this works. I’ve looked through the documentation, but it’s fairly sparse right now. Here’s a MWE:

using Arrow, DataFrames
df = repeat(DataFrame(a=1:10, b=["Very long string $i" for i in 1:10]), 10)
Arrow.DictEncode(df.b)
Arrow.write("test.arrow", df)
tbl = Arrow.Table("test.arrow")

The resulting table looks like this:

julia> tbl = Arrow.Table("test.arrow")
Arrow.Table with 100 rows, 2 columns, and schema:
 :a  Int64
 :b  String

and the type does not appear to have changed:

julia> typeof(Arrow.DictEncode(tbl.b))
Arrow.DictEncode{String,Arrow.List{String,Int32,Array{UInt8,1}}}

julia> typeof(tbl.b)
Arrow.List{String,Int32,Array{UInt8,1}}

Any help would be much appreciated!

I was able to work this out; the type of the column has to be wrapped in DictEncode before the file is written. My MWE did not replace df.b with the wrapped column, so the df used to write the Arrow table was not dict encoded.

julia> typeof(df.b)
Array{String,1}

julia> df.b = Arrow.DictEncode(df.b);

julia> typeof(df.b)
Arrow.DictEncode{String,Array{String,1}}

The dict encoding works now:

julia> Arrow.write("test.arrow", df)
"test.arrow"

julia> tbl = Arrow.Table("test.arrow")
Arrow.Table with 100 rows, 2 columns, and schema:
 :a  Int64
 :b  String

julia> Arrow.isdictencoded(tbl.b)
true

This can easily be extended to multiple columns using the transform! function:

using Arrow, DataFrames
df = repeat(DataFrame(a=1:10, b=["Very long string $i" for i in 1:10]), 10)
df.c = df.b
toencode = [:b, :c]
transform!(df, toencode .=> Arrow.DictEncode; renamecols=false)
Arrow.write("test.arrow", df)
tbl = Arrow.Table("test.arrow")

Only the chosen columns are DictEncoded:

julia> map(Arrow.isdictencoded, tbl)
3-element Array{Bool,1}:
 0
 1
 1