Hello,
I am writing my log files in apache arrow format using Arrow.jl User Manual · Arrow.jl
Now I want to read it in Python, see Feather File Format — Apache Arrow v9.0.0
But the Python documentation mentions different file formats, like Feather V1, Feather V2 and “Streaming format”, “File or Random Access format” and more …
In which format does Arrow.jl writes its files?
And when should I use Arrow.jl and when Parquet.jl ?
Some progress. The following (partially) works:
import pandas as pd
import pyarrow as pa
print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')
with mmap as source:
array = pa.ipc.open_file(source).read_all()
print(array.schema)
# time: double not null
# orient: list<: float not null> not null
# child 0, : float not null
# elevation: float not null
# azimuth: float not null
# l_tether: float not null
# v_reelout: float not null
# force: float not null
# depower: float not null
# steering: float not null
# heading: float not null
# course: float not null
# v_app: float not null
# vel_kite: list<: float not null> not null
# child 0, : float not null
# X: list<: float not null> not null
# child 0, : float not null
# Y: list<: float not null> not null
# child 0, : float not null
# Z: list<: float not null> not null
# child 0, : float not null
print(array[0]) # time, works
print(array[1]) # orientation, works
# not working
# table = pa.Table.from_arrays([array], names=["col1"])
# ArrowInvalid: Could not convert <pyarrow.lib.ChunkedArray object at 0x7f4255395eb8>
# [
# [
# 0,
# 0.05,
# 0.1,
# 0.15000000000000002,
# 0.2,
# ...
# 49.849999999999305,
# 49.8999999999993,
# 49.9499999999993
# ]
# ] with type pyarrow.lib.ChunkedArray: did not recognize Python value type when inferring an Arrow data type
I want to rephrase my question: Can I modify my Julia export such that it easier for Python to process the resulting file?
It works now:
import pandas as pd
import pyarrow as pa
print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')
with mmap as source:
array = pa.ipc.open_file(source).read_all()
# print(array.schema)
# time: double not null
# orient: list<: float not null> not null
# child 0, : float not null
# elevation: float not null
# azimuth: float not null
# l_tether: float not null
# v_reelout: float not null
# force: float not null
# depower: float not null
# steering: float not null
# heading: float not null
# course: float not null
# v_app: float not null
# vel_kite: list<: float not null> not null
# child 0, : float not null
# X: list<: float not null> not null
# child 0, : float not null
# Y: list<: float not null> not null
# child 0, : float not null
# Z: list<: float not null> not null
# child 0, : float not null
# print(array[0]) # time, works
# print(array[1]) # orientation, works
# not working
t_time = pa.Table.from_arrays([array[0]], names=["time"])
t_orient = pa.Table.from_arrays([array[1]], names=["orient"])
print(t_orient)
print("\n")
But I need to create one table per column which I find strange. StructArrays in Julia seam to be much more powerful than Python tables or Pandas dataframes…
Even better: We can convert the arrow table to a Pandas dataframe:
import pandas as pd
import pyarrow as pa
print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')
with mmap as source:
array = pa.ipc.open_file(source).read_all()
# print(array.schema)
# print(array[0]) # time, works
# print(array[1]) # orientation, works
# the following works, but you need one table per column
t_time = pa.Table.from_arrays([array[0]], names=["time"])
t_orient = pa.Table.from_arrays([array[1]], names=["orient"])
# this gives just one table;
table = array.to_pandas()
print(table)
The type of columns that contain vectors is “object”, they contain numpy arrays… Should be easy to work with…
1 Like
To answer your first question, the Arrow files produced by Julia’s Arrow.write
are in what is called Feather V2 in the documentation for the Python and R packages. You can access such a file directly as a Pandas data frame using pyarrow.feather.read_feather
. I don’t tend to use the Python REPL directly. Using PyCall within Julia it looks like
julia> using PyCall
julia> feather = pyimport("pyarrow.feather");
julia> feather.read_feather("./biofast-data-v1/ex-rna.arrow")
PyObject chromosome start stop
0 chr2 216499331 216501458
1 chr7 101239611 101245071
2 chr19 49487626 49491841
3 chr10 80155590 80169336
4 chr17 76270411 76271290
... ... ... ...
8942864 chr11 59636724 59666963
8942865 chr15 66499314 66503529
8942866 chrX 153785767 153787586
8942867 chr17 81509969 81512767
8942868 chr1 182839364 182887745
[8942869 rows x 3 columns]
4 Likes