Displaying a parquet file in Arrow

oo92 · March 17, 2021, 5:23pm

Hi.

I am trying to add a Parquet file into Arrow. I tried to follow the Arrow.jl docs and implement it like this:

begin
	df = CSV.read("/home/onur/julia-assignment/temp.csv", DataFrame)
	file = file = "/home/onur/julia-assignment/temp.parquet"
	table = Arrow.write(file)
	write_parquet(file, df)	
end

I converted a CSV file to parquet and then brought it into Arrow. So when I try to get the dates and countries columns from my parquet file inside Arrow:

begin
	dates = names(table)[5:end]
	countries = unique(table[:, :"Country/Region"])
end

I get a MethodError:

MethodError: no method matching names(::String)

Closest candidates are:

names(!Matched::DataFrames.Index) at /home/onur/.julia/packages/DataFrames/oQ5c7/src/other/index.jl:34

names(!Matched::Module; all, imported) at reflection.jl:98

names(!Matched::DataFrames.SubIndex) at /home/onur/.julia/packages/DataFrames/oQ5c7/src/other/index.jl:425

...

My goal is to convert a CSV to Parquet, bring the Parquet file into Arrow and perform statistics and data analysis.

nilshg · March 17, 2021, 5:50pm

This is the third time you’ve asked questions based around the same confusion around storage format versus in-memory representation of data read from a given storage format, so let me try to clarify again:

When you do df = CSV.read("file.csv", DataFrame), CSV.jl reads the data stored in the file file on your harddrive and turns it into a DataFrame object, which is stored in the RAM of your computer (and bound to the variable df).

When you do file = "/home/onur/julia-assignment/temp.parquet", you are just creating a variable called file, which references a string:

julia> file = "/home/onur/julia-assignment/temp.parquet"

"/home/onur/julia-assignment/temp.parquet"

When you then do Arrow.write(file), you’re just calling Arrow’s write function on a string, not on any actual data.

write_parquet(file, df) will actually write your data to the specified file path in parquet format. However, you then do:

dates = names(table)[5:end]

which doesn’t actually involve your data - you assigned table = Arrow.write(file) above, so table is actually an anonymous function (your MethodError suggests table is actually a string rather than a function, so maybe you’ve assigned it differently elsewhere?)

In any case, the main point remains: you should just perform your statistics and data analysis once you’ve read the data into a DataFrame; there is little point in doing (what it seems you are suggesting):

using DataFrames, CSV, Parquet, Arrow

df = CSV.read("myfile.csv", DataFrame)

write_parquet("myfile.parquet", df)

df = read_parquet("myfile.parquet")

Arrow.write("myfile.arrow", df)

df = DataFrame(Arrow.Table("myfile.arrow"))

as df will be exactly the same at all points - the DataFrame will not change, irrespective of whether you read it in from a csv, arrow, or parquet format.

There might be situations where it is beneficial to read a csv and save it back out as Arrow (for faster reading in on subsequent runs), but I can’t see a situation where it would make sense to go CSV → Parquet → Arrow, especially not in the same session when one just wants to analyse a DataFrame.

pdeffebach · March 17, 2021, 6:04pm

Your object table is a string.

This is because when you did Arrow.write(file), you wrote the data to a file and then Julia returned the string name of that file.

That is, table does not point to a table object, it’s just a string that is the name of the arrow file.

nilshg · March 17, 2021, 6:28pm

Are you sure about that?

julia> using Arrow

julia> file = file = "/home/onur/julia-assignment/temp.parquet"
"/home/onur/julia-assignment/temp.parquet"

julia> table = Arrow.write(file)
#97 (generic function with 1 method)

(although it would of course explain the MethodError complaining about String!)

pdeffebach · March 17, 2021, 6:30pm

Yeah, I think you aren’t hitting the right method. See here.

nilshg · March 17, 2021, 6:50pm

But you can’t hit that method unless you’re supplying a second argument, which the OP didn’t do above, right?

pdeffebach · March 17, 2021, 6:55pm

Ah you are correct. But still that’s probably OPs issue.

djholiver · March 17, 2021, 7:32pm

Hi,

I was about to write something similar as have seen the same posts.

I believe the end goal for @oo92 is to experiment with the Arrow format with a view to replacing in memory datasets for his analysis. I’m unclear on the need for the Parquet step, since no (obvious) partition attempt is made and instead direct CSV → Arrow is achievable

To that end:

sourceFileLocation = "/home/onur/julia-assignment/temp.csv"
arrowfile  = "/home/onur/julia-assignment/temp.arrow"
Arrow.write(arrowfile , CSV.File(sourceFileLocation, header = true))

will create an arrow file from the original csv

you can now interact with it as if it were an indexed array, or by column:

tableWithDatesIn =  Arrow.Table(arrowfile)
countries = tableWithDatesIn[:"Country/Region"]

for row in countries
   println(row)
end

you can also now index into the column / row as appropriate - but bear in mind that this format is column - oriented so accessing across a row is not as performant.

note - the was written on my phone so apologies if copy - pasting the above doesn’t just work.

Given your other posts and Parquet comments, are you trying to establish the best binary storage and access strategy for querying in your analytics approach?

Regards,

Topic		Replies	Views
Reading Parquet file into Apache Arrow? Data dataframes	5	1016	November 27, 2020
Reading and writing Apache arrow files General Usage question , package , arrow	4	5816	May 28, 2022
Unable to write DataFrame to Parquet or Arrow? Data question	7	649	July 27, 2021
Writing Parquet files General Usage	28	5361	November 12, 2020
An example of Apache Arrow file? Data arrow	7	2954	April 22, 2021

Displaying a parquet file in Arrow

Related topics