Suppose I have a vector of MyStruct, the fields of which are all “simple” Julia types, and I want to save it to a parquet file.
struct MyStruct
a::Float64
b::Float64
end
x = [MyStruct(x,y) for (x,y) in zip(randn(100),randn(100))]
write_parquet("/home/myuser/test.parquet", x)
This works.
Now suppose I want a slightly trickier example,
@enum MyEnum UP DOWN
struct MyStruct2
a::DateTime
b::MyEnum
end
x = [MyStruct2(DateTime(2020,6,1,10,10,10),UP), MyStruct2(DateTime(2020,6,1,10,10,11),DOWN)]
write_parquet("/home/myuser/x.parquet", x)
Now I get:
ERROR: "Column whose `eltype` is DateTime is not supported at this stage. \nColumn whose `eltype` is MyEnum is not supported at this stage. \n"
Stacktrace:
[1] write_parquet(::String, ::Array{MyStruct2,1}; compression_codec::String) at /home/myuser/.julia/packages/Parquet/2HfNB/src/writer.jl:479
[2] write_parquet(::String, ::Array{MyStruct2,1}) at /home/myuser/.julia/packages/Parquet/2HfNB/src/writer.jl:464
[3] top-level scope at /home/myuser/scratch.jl:76
DateTime
can be converted to Float64 and Enum
to Int64
. What is the correct way of telling Parquet.jl
to perform those conversions? I briefly looked at the code and it seems there’s no way of doing this, but it’s possible I’m missing something?
One possibility which works is manually doing the conversions:
mapped_x = map(y->(datetime2unix(y.a),Int64(y.b)), x)
write_parquet("/home/myuser/x.parquet", mapped_x)
One annoyance of this is that it doesn’t take advantage of the Tables interface as thoroughly as it could - in particular it loses the columns names.
So the first question is: Does Parquet.jl
have an option for type conversion? If not, the second question is: is there a fundamental reason for that, or is it just a matter of someone implementing it? If the latter I wouldn’t mind giving it a go, perhaps with some guidance from xiaodai?