Writing Parquet files

Late reply, but pyarrow can be used to read/write parquet files in Python without having to have Spark at hand. It only needs Pandas (at least, that’s the combination I’ve used). Plus, parquet is quite a nice file format, even for moderate datasets, say a few GBs.

that’s controversial

In what way? It’s a column-based storage format that supports compression and things like explicit nulls. It has its use cases

What would it take to support Date columns?

julia> df = DataFrame(date = Date(now()), x = 1:1000, y = rand(1000));

julia> write_parquet("test.parquet", df)
ERROR: "Column whose `eltype` is Date is not supported at this stage. \n"
Stacktrace:
 [1] write_parquet(::String, ::DataFrame; compression_codec::String) at /home/tkwong/.julia/packages/Parquet/2HfNB/src/writer.jl:479
 [2] write_parquet(::String, ::DataFrame) at /home/tkwong/.julia/packages/Parquet/2HfNB/src/writer.jl:464
2 Likes

No native categorical support.

Hmm think have to revive the 96 bytes

Interestingly, DateTime works.

I could not make this work:

using DataFrames, Dates
 df = DataFrame(date = DateTime(now()), x = 1:1000, y = rand(1000));

write_parquet("test.parquet", df)
ERROR: "Column whose `eltype` is DateTime is not supported at this stage. \n"

i thought someone merged a datetime fix. Try the master branch.