Reading and writing Apache arrow files

ufechner7 · May 28, 2022, 6:03pm

Hello,
I am writing my log files in apache arrow format using Arrow.jl User Manual · Arrow.jl

Now I want to read it in Python, see Feather File Format — Apache Arrow v9.0.0

But the Python documentation mentions different file formats, like Feather V1, Feather V2 and “Streaming format”, “File or Random Access format” and more …

In which format does Arrow.jl writes its files?

And when should I use Arrow.jl and when Parquet.jl ?

ufechner7 · May 28, 2022, 6:36pm

Some progress. The following (partially) works:

import pandas as pd
import pyarrow as pa

print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')

with mmap as source:
    array = pa.ipc.open_file(source).read_all()

print(array.schema)
    # time: double not null
    # orient: list<: float not null> not null
    #   child 0, : float not null
    # elevation: float not null
    # azimuth: float not null
    # l_tether: float not null
    # v_reelout: float not null
    # force: float not null
    # depower: float not null
    # steering: float not null
    # heading: float not null
    # course: float not null
    # v_app: float not null
    # vel_kite: list<: float not null> not null
    #   child 0, : float not null
    # X: list<: float not null> not null
    #   child 0, : float not null
    # Y: list<: float not null> not null
    #   child 0, : float not null
    # Z: list<: float not null> not null
    #   child 0, : float not null

print(array[0]) # time, works
print(array[1]) # orientation, works

# not working 
# table = pa.Table.from_arrays([array], names=["col1"])
# ArrowInvalid: Could not convert <pyarrow.lib.ChunkedArray object at 0x7f4255395eb8>
# [
#   [
#     0,
#     0.05,
#     0.1,
#     0.15000000000000002,
#     0.2,
#     ...
#     49.849999999999305,
#     49.8999999999993,
#     49.9499999999993
#   ]
# ] with type pyarrow.lib.ChunkedArray: did not recognize Python value type when inferring an Arrow data type

I want to rephrase my question: Can I modify my Julia export such that it easier for Python to process the resulting file?

ufechner7 · May 28, 2022, 6:48pm

It works now:

import pandas as pd
import pyarrow as pa

print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')

with mmap as source:
    array = pa.ipc.open_file(source).read_all()

# print(array.schema)
    # time: double not null
    # orient: list<: float not null> not null
    #   child 0, : float not null
    # elevation: float not null
    # azimuth: float not null
    # l_tether: float not null
    # v_reelout: float not null
    # force: float not null
    # depower: float not null
    # steering: float not null
    # heading: float not null
    # course: float not null
    # v_app: float not null
    # vel_kite: list<: float not null> not null
    #   child 0, : float not null
    # X: list<: float not null> not null
    #   child 0, : float not null
    # Y: list<: float not null> not null
    #   child 0, : float not null
    # Z: list<: float not null> not null
    #   child 0, : float not null

# print(array[0]) # time, works
# print(array[1]) # orientation, works

# not working 
t_time   = pa.Table.from_arrays([array[0]], names=["time"])
t_orient = pa.Table.from_arrays([array[1]], names=["orient"])
print(t_orient)
print("\n")

But I need to create one table per column which I find strange. StructArrays in Julia seam to be much more powerful than Python tables or Pandas dataframes…

ufechner7 · May 28, 2022, 7:05pm

Even better: We can convert the arrow table to a Pandas dataframe:

import pandas as pd
import pyarrow as pa

print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')

with mmap as source:
    array = pa.ipc.open_file(source).read_all()

# print(array.schema)

# print(array[0]) # time, works
# print(array[1]) # orientation, works

# the following works, but you need one table per column
t_time   = pa.Table.from_arrays([array[0]], names=["time"])
t_orient = pa.Table.from_arrays([array[1]], names=["orient"])

# this gives just one table; 
table = array.to_pandas()
print(table)

The type of columns that contain vectors is “object”, they contain numpy arrays… Should be easy to work with…

dmbates · May 28, 2022, 8:01pm

To answer your first question, the Arrow files produced by Julia’s Arrow.write are in what is called Feather V2 in the documentation for the Python and R packages. You can access such a file directly as a Pandas data frame using pyarrow.feather.read_feather. I don’t tend to use the Python REPL directly. Using PyCall within Julia it looks like

julia> using PyCall

julia> feather = pyimport("pyarrow.feather");

julia> feather.read_feather("./biofast-data-v1/ex-rna.arrow")
PyObject         chromosome      start       stop
0             chr2  216499331  216501458
1             chr7  101239611  101245071
2            chr19   49487626   49491841
3            chr10   80155590   80169336
4            chr17   76270411   76271290
...            ...        ...        ...
8942864      chr11   59636724   59666963
8942865      chr15   66499314   66503529
8942866       chrX  153785767  153787586
8942867      chr17   81509969   81512767
8942868       chr1  182839364  182887745

[8942869 rows x 3 columns]

Topic		Replies	Views
Cannot read file written by Arrow.jl in Python Data python , arrow	3	576	April 25, 2023
An example of Apache Arrow file? Data arrow	7	2878	April 22, 2021
[ANN] Arrow.jl 0.3 Release Data arrow	21	3174	March 16, 2021
Displaying a parquet file in Arrow New to Julia dataframes , parquet , arrow	7	1559	March 17, 2021
Passing an Arrow Table from Python to Julia Data python , arrow	2	1456	February 26, 2021

Reading and writing Apache arrow files

Related topics