Displaying a parquet file in Arrow

Hi.

I am trying to add a Parquet file into Arrow. I tried to follow the Arrow.jl docs and implement it like this:

begin
	df = CSV.read("/home/onur/julia-assignment/temp.csv", DataFrame)
	file = file = "/home/onur/julia-assignment/temp.parquet"
	table = Arrow.write(file)
	write_parquet(file, df)	
end

I converted a CSV file to parquet and then brought it into Arrow. So when I try to get the dates and countries columns from my parquet file inside Arrow:

begin
	dates = names(table)[5:end]
	countries = unique(table[:, :"Country/Region"])
end

I get a MethodError:

MethodError: no method matching names(::String)

Closest candidates are:

names(!Matched::DataFrames.Index) at /home/onur/.julia/packages/DataFrames/oQ5c7/src/other/index.jl:34

names(!Matched::Module; all, imported) at reflection.jl:98

names(!Matched::DataFrames.SubIndex) at /home/onur/.julia/packages/DataFrames/oQ5c7/src/other/index.jl:425

...

My goal is to convert a CSV to Parquet, bring the Parquet file into Arrow and perform statistics and data analysis.

This is the third time you’ve asked questions based around the same confusion around storage format versus in-memory representation of data read from a given storage format, so let me try to clarify again:

When you do df = CSV.read("file.csv", DataFrame), CSV.jl reads the data stored in the file file on your harddrive and turns it into a DataFrame object, which is stored in the RAM of your computer (and bound to the variable df).

When you do file = "/home/onur/julia-assignment/temp.parquet", you are just creating a variable called file, which references a string:

julia> file = "/home/onur/julia-assignment/temp.parquet"

"/home/onur/julia-assignment/temp.parquet"

When you then do Arrow.write(file), you’re just calling Arrow’s write function on a string, not on any actual data.

write_parquet(file, df) will actually write your data to the specified file path in parquet format. However, you then do:

dates = names(table)[5:end]

which doesn’t actually involve your data - you assigned table = Arrow.write(file) above, so table is actually an anonymous function (your MethodError suggests table is actually a string rather than a function, so maybe you’ve assigned it differently elsewhere?)

In any case, the main point remains: you should just perform your statistics and data analysis once you’ve read the data into a DataFrame; there is little point in doing (what it seems you are suggesting):

using DataFrames, CSV, Parquet, Arrow

df = CSV.read("myfile.csv", DataFrame)

write_parquet("myfile.parquet", df)

df = read_parquet("myfile.parquet")

Arrow.write("myfile.arrow", df)

df = DataFrame(Arrow.Table("myfile.arrow"))

as df will be exactly the same at all points - the DataFrame will not change, irrespective of whether you read it in from a csv, arrow, or parquet format.

There might be situations where it is beneficial to read a csv and save it back out as Arrow (for faster reading in on subsequent runs), but I can’t see a situation where it would make sense to go CSV → Parquet → Arrow, especially not in the same session when one just wants to analyse a DataFrame.

5 Likes

Your object table is a string.

This is because when you did Arrow.write(file), you wrote the data to a file and then Julia returned the string name of that file.

That is, table does not point to a table object, it’s just a string that is the name of the arrow file.

Are you sure about that?

julia> using Arrow

julia> file = file = "/home/onur/julia-assignment/temp.parquet"
"/home/onur/julia-assignment/temp.parquet"

julia> table = Arrow.write(file)
#97 (generic function with 1 method)

(although it would of course explain the MethodError complaining about String!)

Yeah, I think you aren’t hitting the right method. See here.

But you can’t hit that method unless you’re supplying a second argument, which the OP didn’t do above, right?

Ah you are correct. But still that’s probably OPs issue.

Hi,

I was about to write something similar as have seen the same posts.

I believe the end goal for @oo92 is to experiment with the Arrow format with a view to replacing in memory datasets for his analysis. I’m unclear on the need for the Parquet step, since no (obvious) partition attempt is made and instead direct CSV → Arrow is achievable

To that end:

sourceFileLocation = "/home/onur/julia-assignment/temp.csv"
arrowfile  = "/home/onur/julia-assignment/temp.arrow"
Arrow.write(arrowfile , CSV.File(sourceFileLocation, header = true))

will create an arrow file from the original csv

you can now interact with it as if it were an indexed array, or by column:

tableWithDatesIn =  Arrow.Table(arrowfile)
countries = tableWithDatesIn[:"Country/Region"]

for row in countries
   println(row)
end

you can also now index into the column / row as appropriate - but bear in mind that this format is column - oriented so accessing across a row is not as performant.

note - the was written on my phone so apologies if copy - pasting the above doesn’t just work.

Given your other posts and Parquet comments, are you trying to establish the best binary storage and access strategy for querying in your analytics approach?

Regards,