How to write in .parquet (or any compressed extension)

I’m doing some simulation that the output can be around 1GB to 10GB, it’s not so big but we are probably going to run lots of simulations. Right now we are saving the output in .txt, but I’d like to write it in a better extension and I thought about .parq. I’ve seen the Parquet.jl but I don’t think I understood how to write with it. Here is how I am saving my files today:

using DelimitedFiles

aux1=10
aux2=20
aux3=30
header = [
    "aux1 = "*string(aux1)*"; This is aux1 in units of a.u.1"
    "aux2 = "*string(aux2)*"; This is aux2 in units of a.u.2"
    "aux3 = "*string(aux3)*"; This is aux3 in units of a.u.3"
]

filename = "testing_file"
format = ".txt"
delimiter = ','

output1_vec = collect(1:1:100)
output2_vec = rand(Complex{Float64}, 100)
output3_vec = rand(100)*1e21
output4_vec = rand(100)*1e2
data = [output1_vec output2_vec output3_vec output4_vec] # in matrix form
data_label = ["vec1 [units of au1]", "vec2 [units of au2]", "vec3 [units of au3]", "vec4 [units of au4]"] # in vector form

#-----------------------

open(filename*format; write=true) do f
    for i in header
        write(f, i*"\n")
    end
    write(f, "----------------------------------------\n")
    for i in data_label
        write(f, i)
        if i != data_label[end]
            write(f, ",")
        else
            write(f, "\n")
        end
    end
    writedlm(f, data, delimiter)
end

Which is basically a header and four columns with complex numbers. Any chance to put this in a more compressed extensions?

can your data be loaded as DataFrames?

The parquet writer isn’t very good in Parquet.jl, I know cos I wrote it.

But CSV.jl is ok also JDF.jl is ok for you data size I think.

For Parquet.jl just read the section on writing files I think ti’s something like

using Parquet

write_parquet("path/tofile.parquet", df)
2 Likes

After putting my data as DataFrames, I could save them as .parquet with no problem, really thanks! But why isn’t it very good? Im looking for compact format, writting speed won’t be a problem for now :slight_smile:
The only question that I still have is how to put a header in my file, like the one I shown in the example.

Too lazy to try and understand your example, but maybe try to look up the rename! function in DataFrames.jl. Suspect that is what u need.

It’s not very optimized, so it could be slow and doesn’t support everything e.g. datetime. I think someone is writing a Parquet2.jl

Ok. Parquet.jl is good as it has compression. JDF.jl also does compression.

Check out this article for comparisons of various formats: https://www.extremerisk.org/blog/computations/julia-and-saved-files/index.html

If you need long-term stability, Parquet.jl is ok and JDF.jl is not yet stable for the long term but I use it since it’s quite fast and stable for my use case.