Reading and writing Apache arrow files

Hello,
I am writing my log files in apache arrow format using Arrow.jl User Manual · Arrow.jl

Now I want to read it in Python, see Feather File Format — Apache Arrow v9.0.0

But the Python documentation mentions different file formats, like Feather V1, Feather V2 and “Streaming format”, “File or Random Access format” and more …

In which format does Arrow.jl writes its files?

And when should I use Arrow.jl and when Parquet.jl ?

Some progress. The following (partially) works:

import pandas as pd
import pyarrow as pa

print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')

with mmap as source:
    array = pa.ipc.open_file(source).read_all()

print(array.schema)
    # time: double not null
    # orient: list<: float not null> not null
    #   child 0, : float not null
    # elevation: float not null
    # azimuth: float not null
    # l_tether: float not null
    # v_reelout: float not null
    # force: float not null
    # depower: float not null
    # steering: float not null
    # heading: float not null
    # course: float not null
    # v_app: float not null
    # vel_kite: list<: float not null> not null
    #   child 0, : float not null
    # X: list<: float not null> not null
    #   child 0, : float not null
    # Y: list<: float not null> not null
    #   child 0, : float not null
    # Z: list<: float not null> not null
    #   child 0, : float not null

print(array[0]) # time, works
print(array[1]) # orientation, works

# not working 
# table = pa.Table.from_arrays([array], names=["col1"])
# ArrowInvalid: Could not convert <pyarrow.lib.ChunkedArray object at 0x7f4255395eb8>
# [
#   [
#     0,
#     0.05,
#     0.1,
#     0.15000000000000002,
#     0.2,
#     ...
#     49.849999999999305,
#     49.8999999999993,
#     49.9499999999993
#   ]
# ] with type pyarrow.lib.ChunkedArray: did not recognize Python value type when inferring an Arrow data type

I want to rephrase my question: Can I modify my Julia export such that it easier for Python to process the resulting file?

It works now:

import pandas as pd
import pyarrow as pa

print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')

with mmap as source:
    array = pa.ipc.open_file(source).read_all()

# print(array.schema)
    # time: double not null
    # orient: list<: float not null> not null
    #   child 0, : float not null
    # elevation: float not null
    # azimuth: float not null
    # l_tether: float not null
    # v_reelout: float not null
    # force: float not null
    # depower: float not null
    # steering: float not null
    # heading: float not null
    # course: float not null
    # v_app: float not null
    # vel_kite: list<: float not null> not null
    #   child 0, : float not null
    # X: list<: float not null> not null
    #   child 0, : float not null
    # Y: list<: float not null> not null
    #   child 0, : float not null
    # Z: list<: float not null> not null
    #   child 0, : float not null

# print(array[0]) # time, works
# print(array[1]) # orientation, works

# not working 
t_time   = pa.Table.from_arrays([array[0]], names=["time"])
t_orient = pa.Table.from_arrays([array[1]], names=["orient"])
print(t_orient)
print("\n")

But I need to create one table per column which I find strange. StructArrays in Julia seam to be much more powerful than Python tables or Pandas dataframes…

Even better: We can convert the arrow table to a Pandas dataframe:

import pandas as pd
import pyarrow as pa

print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')

with mmap as source:
    array = pa.ipc.open_file(source).read_all()

# print(array.schema)

# print(array[0]) # time, works
# print(array[1]) # orientation, works

# the following works, but you need one table per column
t_time   = pa.Table.from_arrays([array[0]], names=["time"])
t_orient = pa.Table.from_arrays([array[1]], names=["orient"])

# this gives just one table; 
table = array.to_pandas()
print(table)

The type of columns that contain vectors is “object”, they contain numpy arrays… Should be easy to work with…

1 Like

To answer your first question, the Arrow files produced by Julia’s Arrow.write are in what is called Feather V2 in the documentation for the Python and R packages. You can access such a file directly as a Pandas data frame using pyarrow.feather.read_feather. I don’t tend to use the Python REPL directly. Using PyCall within Julia it looks like

julia> using PyCall

julia> feather = pyimport("pyarrow.feather");

julia> feather.read_feather("./biofast-data-v1/ex-rna.arrow")
PyObject         chromosome      start       stop
0             chr2  216499331  216501458
1             chr7  101239611  101245071
2            chr19   49487626   49491841
3            chr10   80155590   80169336
4            chr17   76270411   76271290
...            ...        ...        ...
8942864      chr11   59636724   59666963
8942865      chr15   66499314   66503529
8942866       chrX  153785767  153787586
8942867      chr17   81509969   81512767
8942868       chr1  182839364  182887745

[8942869 rows x 3 columns]
4 Likes