Hello,
I am writing my log files in apache arrow format using Arrow.jl User Manual · Arrow.jl
Now I want to read it in Python, see Feather File Format — Apache Arrow v9.0.0
But the Python documentation mentions different file formats, like Feather V1, Feather V2 and “Streaming format”, “File or Random Access format” and more …
In which format does Arrow.jl writes its files?
And when should I use Arrow.jl and when Parquet.jl ?
Some progress. The following (partially) works:
import pandas as pd
import pyarrow as pa
print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')
with mmap as source:
array = pa.ipc.open_file(source).read_all()
print(array.schema)
# time: double not null
# orient: list<: float not null> not null
# child 0, : float not null
# elevation: float not null
# azimuth: float not null
# l_tether: float not null
# v_reelout: float not null
# force: float not null
# depower: float not null
# steering: float not null
# heading: float not null
# course: float not null
# v_app: float not null
# vel_kite: list<: float not null> not null
# child 0, : float not null
# X: list<: float not null> not null
# child 0, : float not null
# Y: list<: float not null> not null
# child 0, : float not null
# Z: list<: float not null> not null
# child 0, : float not null
print(array[0]) # time, works
print(array[1]) # orientation, works
# not working
# table = pa.Table.from_arrays([array], names=["col1"])
# ArrowInvalid: Could not convert <pyarrow.lib.ChunkedArray object at 0x7f4255395eb8>
# [
# [
# 0,
# 0.05,
# 0.1,
# 0.15000000000000002,
# 0.2,
# ...
# 49.849999999999305,
# 49.8999999999993,
# 49.9499999999993
# ]
# ] with type pyarrow.lib.ChunkedArray: did not recognize Python value type when inferring an Arrow data type
I want to rephrase my question: Can I modify my Julia export such that it easier for Python to process the resulting file?
It works now:
import pandas as pd
import pyarrow as pa
print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')
with mmap as source:
array = pa.ipc.open_file(source).read_all()
# print(array.schema)
# time: double not null
# orient: list<: float not null> not null
# child 0, : float not null
# elevation: float not null
# azimuth: float not null
# l_tether: float not null
# v_reelout: float not null
# force: float not null
# depower: float not null
# steering: float not null
# heading: float not null
# course: float not null
# v_app: float not null
# vel_kite: list<: float not null> not null
# child 0, : float not null
# X: list<: float not null> not null
# child 0, : float not null
# Y: list<: float not null> not null
# child 0, : float not null
# Z: list<: float not null> not null
# child 0, : float not null
# print(array[0]) # time, works
# print(array[1]) # orientation, works
# not working
t_time = pa.Table.from_arrays([array[0]], names=["time"])
t_orient = pa.Table.from_arrays([array[1]], names=["orient"])
print(t_orient)
print("\n")
But I need to create one table per column which I find strange. StructArrays in Julia seam to be much more powerful than Python tables or Pandas dataframes…
Even better: We can convert the arrow table to a Pandas dataframe:
import pandas as pd
import pyarrow as pa
print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')
with mmap as source:
array = pa.ipc.open_file(source).read_all()
# print(array.schema)
# print(array[0]) # time, works
# print(array[1]) # orientation, works
# the following works, but you need one table per column
t_time = pa.Table.from_arrays([array[0]], names=["time"])
t_orient = pa.Table.from_arrays([array[1]], names=["orient"])
# this gives just one table;
table = array.to_pandas()
print(table)
The type of columns that contain vectors is “object”, they contain numpy arrays… Should be easy to work with…
To answer your first question, the Arrow files produced by Julia’s Arrow.write
are in what is called Feather V2 in the documentation for the Python and R packages. You can access such a file directly as a Pandas data frame using pyarrow.feather.read_feather
. I don’t tend to use the Python REPL directly. Using PyCall within Julia it looks like
julia> using PyCall
julia> feather = pyimport("pyarrow.feather");
julia> feather.read_feather("./biofast-data-v1/ex-rna.arrow")
PyObject chromosome start stop
0 chr2 216499331 216501458
1 chr7 101239611 101245071
2 chr19 49487626 49491841
3 chr10 80155590 80169336
4 chr17 76270411 76271290
... ... ... ...
8942864 chr11 59636724 59666963
8942865 chr15 66499314 66503529
8942866 chrX 153785767 153787586
8942867 chr17 81509969 81512767
8942868 chr1 182839364 182887745
[8942869 rows x 3 columns]
2 Likes