Converting CSV to Parquet in Julia

oo92 · March 16, 2021, 7:13pm

Hi.

I have a simple dataframe that I want to convert to parquet. This is my attempt:

begin
	df = CSV.read("/home/onur/julia-assignment/temp.csv", DataFrame)
	prq = Parquet.File(df)
end

But this is the error I’m getting:

MethodError: no method matching Parquet.File(::DataFrames.DataFrame)

Closest candidates are:

Parquet.File(::Any, !Matched::Any, !Matched::Any, !Matched::Any, !Matched::Any) at /home/onur/.julia/packages/Parquet/h8mm5/src/reader.jl:54

Parquet.File(!Matched::String, !Matched::IOStream, !Matched::Parquet.PAR2.FileMetaData, !Matched::Parquet.Schema, !Matched::Parquet.PageLRU) at /home/onur/.julia/packages/Parquet/h8mm5/src/reader.jl:54

Parquet.File(!Matched::AbstractString; map_logical_types) at /home/onur/.julia/packages/Parquet/h8mm5/src/reader.jl:61

How can I open a CSV file as Parquet?

Skoffer · March 16, 2021, 7:31pm

Judging by the Parquet.jl README it should be

using CSV, Parquet

df = CSV.read("/home/onur/julia-assignment/temp.csv", DataFrame)
file = tempname() * ".parquet"
write_parquet(file, df)

or any other filename of your choice.

oo92 · March 16, 2021, 7:35pm

And how can I view the file in Pluto like I would with Pandas in Jupyter?

Skoffer · March 16, 2021, 7:40pm

I am not sure I understand the question, sorry. Parquet, CSV, Arrow and so on, are just storage formats. I suppose it is possible to do something with the data representation, but this is usually something low level. It’s not the way how usually people are working with data. Roughly speaking, common way is to store in one or another format, then load it to memory and transform to a representation which is more suitable for data manipulation. After everything is done you store it again in necessary format.

Data representation which is convenient for various manipulations is pandas in python, dataframes in R, and DataFrame in Julia. But you already did it on the first step, when you loaded data from the CSV.

oo92 · March 16, 2021, 7:40pm

Can I view the file in Pluto? Is there a way to do that? That’s what I am curious about.

Skoffer · March 16, 2021, 7:43pm

What is “viewing the file”? You can get it binary presentation with read(file). Or, you can load it with DataFrame(read_parquet(path)) but that should give you more or less the same DataFrame that you get on CSV.read step.

oo92 · March 16, 2021, 7:44pm

Can I recreate the CSV file as Parquet in my working directory?

oo92 · March 16, 2021, 8:02pm

This is the error I got

MethodError: no method matching read_parquet(::String, ::DataFrames.DataFrame)

lungben · March 16, 2021, 8:15pm

Just put the name of the DataFrame into a Pluto cell to view it in Pluto:

df

oo92 · March 16, 2021, 8:16pm

Yea but how can I confirm if the output of df is now Parquet and not CSV, as it used to be?

lungben · March 16, 2021, 8:18pm

Viewing the DataFrame and writing to disk are completely separate topics.
CSV and Parquet are disk formats, DataFrames are in memory.

oo92 · March 16, 2021, 8:22pm

Can I write this CSV file also as a Parquet file to my working directory? If so, how can I do that?

lungben · March 16, 2021, 8:24pm

Have you tried the method described by @Skoffer ?

oo92 · March 16, 2021, 8:25pm

Yea. I don’t see a parquet file in my current directory.

Skoffer · March 16, 2021, 8:31pm

Just change file definition to

file = "/home/onur/julia-assignment/temp.parquet"

oo92 · March 16, 2021, 8:33pm

Wait. Just changing the file extension automatically converts to parquet?

Skoffer · March 16, 2021, 8:34pm

Obviously not. Changing directory from /tmp (as it is produced by tempfile) to `/home/onur/julia-assignment’ changes location of the resulting file.

oo92 · March 16, 2021, 8:36pm

I get this

ArgumentError: "/home/onur/julia-assignment/temp.parquet" is not a valid file

Skoffer · March 16, 2021, 8:37pm

using CSV, Parquet

df = CSV.read("/home/onur/julia-assignment/temp.csv", DataFrame)
file = "/home/onur/julia-assignment/temp.parquet"
write_parquet(file, df)

Which line exactly giving you this error? Can you show the complete output?

oo92 · March 16, 2021, 8:39pm

Nvm. I messed up on this line. It was my mistake. Thank you very much.

Topic		Replies	Views
Displaying a parquet file in Arrow New to Julia dataframes , parquet , arrow	7	1557	March 17, 2021
File IO - Parquet File Reader Data	4	1200	October 30, 2018
Recommended Saves and Loads of DataFrame : JLD, CSV, etc Data	8	2894	August 30, 2020
Reading Parquet file into Apache Arrow? Data dataframes	5	988	November 27, 2020
Unable to write DataFrame to Parquet or Arrow? Data question	7	607	July 27, 2021

Converting CSV to Parquet in Julia

Related topics