Reading and writing Apache arrow files

Hello,
I am writing my log files in apache arrow format using Arrow.jl User Manual · Arrow.jl

Now I want to read it in Python, see Feather File Format — Apache Arrow v8.0.0

But the Python documentation mentions different file formats, like Feather V1, Feather V2 and “Streaming format”, “File or Random Access format” and more …

In which format does Arrow.jl writes its files?

And when should I use Arrow.jl and when Parquet.jl ?

Some progress. The following (partially) works:

import pandas as pd
import pyarrow as pa

print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')

with mmap as source:
    array = pa.ipc.open_file(source).read_all()

print(array.schema)
    # time: double not null
    # orient: list<: float not null> not null
    #   child 0, : float not null
    # elevation: float not null
    # azimuth: float not null
    # l_tether: float not null
    # v_reelout: float not null
    # force: float not null
    # depower: float not null
    # steering: float not null
    # heading: float not null
    # course: float not null
    # v_app: float not null
    # vel_kite: list<: float not null> not null
    #   child 0, : float not null
    # X: list<: float not null> not null
    #   child 0, : float not null
    # Y: list<: float not null> not null
    #   child 0, : float not null
    # Z: list<: float not null> not null
    #   child 0, : float not null

print(array[0]) # time, works
print(array[1]) # orientation, works

# not working 
# table = pa.Table.from_arrays([array], names=["col1"])
# ArrowInvalid: Could not convert <pyarrow.lib.ChunkedArray object at 0x7f4255395eb8>
# [
#   [
#     0,
#     0.05,
#     0.1,
#     0.15000000000000002,
#     0.2,
#     ...
#     49.849999999999305,
#     49.8999999999993,
#     49.9499999999993
#   ]
# ] with type pyarrow.lib.ChunkedArray: did not recognize Python value type when inferring an Arrow data type

I want to rephrase my question: Can I modify my Julia export such that it easier for Python to process the resulting file?

It works now:

import pandas as pd
import pyarrow as pa

print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')

with mmap as source:
    array = pa.ipc.open_file(source).read_all()

# print(array.schema)
    # time: double not null
    # orient: list<: float not null> not null
    #   child 0, : float not null
    # elevation: float not null
    # azimuth: float not null
    # l_tether: float not null
    # v_reelout: float not null
    # force: float not null
    # depower: float not null
    # steering: float not null
    # heading: float not null
    # course: float not null
    # v_app: float not null
    # vel_kite: list<: float not null> not null
    #   child 0, : float not null
    # X: list<: float not null> not null
    #   child 0, : float not null
    # Y: list<: float not null> not null
    #   child 0, : float not null
    # Z: list<: float not null> not null
    #   child 0, : float not null

# print(array[0]) # time, works
# print(array[1]) # orientation, works

# not working 
t_time   = pa.Table.from_arrays([array[0]], names=["time"])
t_orient = pa.Table.from_arrays([array[1]], names=["orient"])
print(t_orient)
print("\n")

But I need to create one table per column which I find strange. StructArrays in Julia seam to be much more powerful than Python tables or Pandas dataframes…

Even better: We can convert the arrow table to a Pandas dataframe:

import pandas as pd
import pyarrow as pa

print("Reading arrow file...")
mmap = pa.memory_map('../data/sim_log_uncompressed.arrow')

with mmap as source:
    array = pa.ipc.open_file(source).read_all()

# print(array.schema)

# print(array[0]) # time, works
# print(array[1]) # orientation, works

# the following works, but you need one table per column
t_time   = pa.Table.from_arrays([array[0]], names=["time"])
t_orient = pa.Table.from_arrays([array[1]], names=["orient"])

# this gives just one table; 
table = array.to_pandas()
print(table)

The type of columns that contain vectors is “object”, they contain numpy arrays… Should be easy to work with…

To answer your first question, the Arrow files produced by Julia’s Arrow.write are in what is called Feather V2 in the documentation for the Python and R packages. You can access such a file directly as a Pandas data frame using pyarrow.feather.read_feather. I don’t tend to use the Python REPL directly. Using PyCall within Julia it looks like

julia> using PyCall

julia> feather = pyimport("pyarrow.feather");

julia> feather.read_feather("./biofast-data-v1/ex-rna.arrow")
PyObject         chromosome      start       stop
0             chr2  216499331  216501458
1             chr7  101239611  101245071
2            chr19   49487626   49491841
3            chr10   80155590   80169336
4            chr17   76270411   76271290
...            ...        ...        ...
8942864      chr11   59636724   59666963
8942865      chr15   66499314   66503529
8942866       chrX  153785767  153787586
8942867      chr17   81509969   81512767
8942868       chr1  182839364  182887745

[8942869 rows x 3 columns]
2 Likes