Write_parquet with non-standard types

freeman · July 10, 2020, 11:35pm

Suppose I have a vector of MyStruct, the fields of which are all “simple” Julia types, and I want to save it to a parquet file.

struct MyStruct
    a::Float64
    b::Float64
end

x = [MyStruct(x,y) for (x,y) in zip(randn(100),randn(100))]

write_parquet("/home/myuser/test.parquet", x)

This works.

Now suppose I want a slightly trickier example,

@enum MyEnum UP DOWN

struct MyStruct2
    a::DateTime
    b::MyEnum
end

x = [MyStruct2(DateTime(2020,6,1,10,10,10),UP), MyStruct2(DateTime(2020,6,1,10,10,11),DOWN)]

write_parquet("/home/myuser/x.parquet", x)

Now I get:

ERROR: "Column whose `eltype` is DateTime is not supported at this stage. \nColumn whose `eltype` is MyEnum is not supported at this stage. \n"
Stacktrace:
 [1] write_parquet(::String, ::Array{MyStruct2,1}; compression_codec::String) at /home/myuser/.julia/packages/Parquet/2HfNB/src/writer.jl:479
 [2] write_parquet(::String, ::Array{MyStruct2,1}) at /home/myuser/.julia/packages/Parquet/2HfNB/src/writer.jl:464
 [3] top-level scope at /home/myuser/scratch.jl:76

DateTime can be converted to Float64 and Enum to Int64. What is the correct way of telling Parquet.jl to perform those conversions? I briefly looked at the code and it seems there’s no way of doing this, but it’s possible I’m missing something?

One possibility which works is manually doing the conversions:

mapped_x = map(y->(datetime2unix(y.a),Int64(y.b)), x)
write_parquet("/home/myuser/x.parquet", mapped_x)

One annoyance of this is that it doesn’t take advantage of the Tables interface as thoroughly as it could - in particular it loses the columns names.

So the first question is: Does Parquet.jl have an option for type conversion? If not, the second question is: is there a fundamental reason for that, or is it just a matter of someone implementing it? If the latter I wouldn’t mind giving it a go, perhaps with some guidance from xiaodai?

xiaodai · July 11, 2020, 1:54am

I didn’t get notifiied! @xiaodai or @ evalparse would be better.

Anyway, in general I haven’t heard many good things about parquet’s datetime. I think by default it loses the time zone info if any.

Theoretically, you need to write custom encoder/decoder and you can deal with arbitrary types but your file is unlikely to be readable by others.

We don’t have a good story there just like Python’s and R’s don’t have a good story for custom type saving in Parquet.

If you want to have a crack I suggest you read up about Parquet’s type system (hard to find I know) but start here GitHub - apache/parquet-format: Apache Parquet and then here https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

You can try to get a writer for it going. It doesn’t sound like it’s for work, but if a company or someone is willing to sponsor, I can prioritise this work too. DM for details.

freeman · July 12, 2020, 9:44am

I thought that this would just be a matter of adding to the Parquet writer a map_logical_types argument, in the same way as the reader does[1], and no need to touch Parquet’s types. Is this not the case?

[1] https://github.com/JuliaIO/Parquet.jl/blob/d0f8be90e426349c700f7ab20957cd60b5e5d2b6/src/reader.jl#L33

Topic		Replies	Views
Struggling to implement Tables.jl interface for Vector{MyStruct} New to Julia data_structures , parquet , tables	8	3522	July 2, 2020
ERROR: UndefVarError: write_parquet not defined General Usage question	0	202	July 1, 2022
Writing Parquet files General Usage	28	5239	November 12, 2020
Converting CSV to Parquet in Julia New to Julia question , csv , parquet	22	1569	August 11, 2024
Unable to write DataFrame to Parquet or Arrow? Data question	7	607	July 27, 2021

Write_parquet with non-standard types

Related topics