Converting CSV to Parquet in Julia

Hi.

I have a simple dataframe that I want to convert to parquet. This is my attempt:

begin
	df = CSV.read("/home/onur/julia-assignment/temp.csv", DataFrame)
	prq = Parquet.File(df)
end

But this is the error I’m getting:

MethodError: no method matching Parquet.File(::DataFrames.DataFrame)

Closest candidates are:

Parquet.File(::Any, !Matched::Any, !Matched::Any, !Matched::Any, !Matched::Any) at /home/onur/.julia/packages/Parquet/h8mm5/src/reader.jl:54

Parquet.File(!Matched::String, !Matched::IOStream, !Matched::Parquet.PAR2.FileMetaData, !Matched::Parquet.Schema, !Matched::Parquet.PageLRU) at /home/onur/.julia/packages/Parquet/h8mm5/src/reader.jl:54

Parquet.File(!Matched::AbstractString; map_logical_types) at /home/onur/.julia/packages/Parquet/h8mm5/src/reader.jl:61

How can I open a CSV file as Parquet?

Judging by the Parquet.jl README it should be

using CSV, Parquet

df = CSV.read("/home/onur/julia-assignment/temp.csv", DataFrame)
file = tempname() * ".parquet"
write_parquet(file, df)

or any other filename of your choice.

1 Like

And how can I view the file in Pluto like I would with Pandas in Jupyter?

I am not sure I understand the question, sorry. Parquet, CSV, Arrow and so on, are just storage formats. I suppose it is possible to do something with the data representation, but this is usually something low level. It’s not the way how usually people are working with data. Roughly speaking, common way is to store in one or another format, then load it to memory and transform to a representation which is more suitable for data manipulation. After everything is done you store it again in necessary format.

Data representation which is convenient for various manipulations is pandas in python, dataframes in R, and DataFrame in Julia. But you already did it on the first step, when you loaded data from the CSV.

1 Like

Can I view the file in Pluto? Is there a way to do that? That’s what I am curious about.

What is “viewing the file”? You can get it binary presentation with read(file). Or, you can load it with DataFrame(read_parquet(path)) but that should give you more or less the same DataFrame that you get on CSV.read step.

Can I recreate the CSV file as Parquet in my working directory?

This is the error I got

MethodError: no method matching read_parquet(::String, ::DataFrames.DataFrame)

Just put the name of the DataFrame into a Pluto cell to view it in Pluto:

df

Yea but how can I confirm if the output of df is now Parquet and not CSV, as it used to be?

Viewing the DataFrame and writing to disk are completely separate topics.
CSV and Parquet are disk formats, DataFrames are in memory.

Can I write this CSV file also as a Parquet file to my working directory? If so, how can I do that?

Have you tried the method described by @Skoffer ?

Yea. I don’t see a parquet file in my current directory.

Just change file definition to

file = "/home/onur/julia-assignment/temp.parquet"

Wait. Just changing the file extension automatically converts to parquet?

Obviously not. Changing directory from /tmp (as it is produced by tempfile) to `/home/onur/julia-assignment’ changes location of the resulting file.

I get this

ArgumentError: "/home/onur/julia-assignment/temp.parquet" is not a valid file
using CSV, Parquet

df = CSV.read("/home/onur/julia-assignment/temp.csv", DataFrame)
file = "/home/onur/julia-assignment/temp.parquet"
write_parquet(file, df)

Which line exactly giving you this error? Can you show the complete output?

Nvm. I messed up on this line. It was my mistake. Thank you very much.

1 Like