This is the third time you’ve asked questions based around the same confusion around storage format versus in-memory representation of data read from a given storage format, so let me try to clarify again:
When you do df = CSV.read("file.csv", DataFrame)
, CSV.jl reads the data stored in the file file
on your harddrive and turns it into a DataFrame
object, which is stored in the RAM of your computer (and bound to the variable df
).
When you do file = "/home/onur/julia-assignment/temp.parquet"
, you are just creating a variable called file
, which references a string:
julia> file = "/home/onur/julia-assignment/temp.parquet"
"/home/onur/julia-assignment/temp.parquet"
When you then do Arrow.write(file)
, you’re just calling Arrow’s write function on a string, not on any actual data.
write_parquet(file, df)
will actually write your data to the specified file path in parquet format. However, you then do:
dates = names(table)[5:end]
which doesn’t actually involve your data - you assigned table = Arrow.write(file)
above, so table
is actually an anonymous function (your MethodError suggests table
is actually a string rather than a function, so maybe you’ve assigned it differently elsewhere?)
In any case, the main point remains: you should just perform your statistics and data analysis once you’ve read the data into a DataFrame; there is little point in doing (what it seems you are suggesting):
using DataFrames, CSV, Parquet, Arrow
df = CSV.read("myfile.csv", DataFrame)
write_parquet("myfile.parquet", df)
df = read_parquet("myfile.parquet")
Arrow.write("myfile.arrow", df)
df = DataFrame(Arrow.Table("myfile.arrow"))
as df
will be exactly the same at all points - the DataFrame
will not change, irrespective of whether you read it in from a csv, arrow, or parquet format.
There might be situations where it is beneficial to read a csv and save it back out as Arrow (for faster reading in on subsequent runs), but I can’t see a situation where it would make sense to go CSV → Parquet → Arrow, especially not in the same session when one just wants to analyse a DataFrame
.