How exactly does the Arrow.DictEncode()
method work? I’ve tried using it to signal that a DataFrame
column be dictionary encoding when writing to a .arrow
file, but I can’t see any indication that this works. I’ve looked through the documentation, but it’s fairly sparse right now. Here’s a MWE:
using Arrow, DataFrames
df = repeat(DataFrame(a=1:10, b=["Very long string $i" for i in 1:10]), 10)
Arrow.DictEncode(df.b)
Arrow.write("test.arrow", df)
tbl = Arrow.Table("test.arrow")
The resulting table looks like this:
julia> tbl = Arrow.Table("test.arrow")
Arrow.Table with 100 rows, 2 columns, and schema:
:a Int64
:b String
and the type does not appear to have changed:
julia> typeof(Arrow.DictEncode(tbl.b))
Arrow.DictEncode{String,Arrow.List{String,Int32,Array{UInt8,1}}}
julia> typeof(tbl.b)
Arrow.List{String,Int32,Array{UInt8,1}}
Any help would be much appreciated!
I was able to work this out; the type of the column has to be wrapped in DictEncode
before the file is written. My MWE did not replace df.b
with the wrapped column, so the df
used to write the Arrow table was not dict encoded.
julia> typeof(df.b)
Array{String,1}
julia> df.b = Arrow.DictEncode(df.b);
julia> typeof(df.b)
Arrow.DictEncode{String,Array{String,1}}
The dict encoding works now:
julia> Arrow.write("test.arrow", df)
"test.arrow"
julia> tbl = Arrow.Table("test.arrow")
Arrow.Table with 100 rows, 2 columns, and schema:
:a Int64
:b String
julia> Arrow.isdictencoded(tbl.b)
true
This can easily be extended to multiple columns using the transform!
function:
using Arrow, DataFrames
df = repeat(DataFrame(a=1:10, b=["Very long string $i" for i in 1:10]), 10)
df.c = df.b
toencode = [:b, :c]
transform!(df, toencode .=> Arrow.DictEncode; renamecols=false)
Arrow.write("test.arrow", df)
tbl = Arrow.Table("test.arrow")
Only the chosen columns are DictEncoded:
julia> map(Arrow.isdictencoded, tbl)
3-element Array{Bool,1}:
0
1
1