An example of Apache Arrow file?

Hello,

I would like to test Julia’s support for Apache Arrow with a big file that doesn’t fit to memory (RAM). I can’t find any example of such file, could you help me with this please?

I guess I need a file that is in Arrow IPC Format (Feather file format), version 2 (see for example Feather File Format — Apache Arrow v9.0.0).

I would like to read the file into a dataframe like this: df = DataFrame(Arrow.Table(the_big_file)), as exemplified here: User Manual · Arrow.jl

Thanks for creating Julia :+1:

1 Like

I’d just find a large dataset and write it out. For example, the nyc taxi dataset is a common example.

You’d download a couple files, then do something like:

Arrow.write("taxi.arrow",
    Tables.partitioner(csv_files) do file
        CSV.File(file)
    end
)

now you have a local arrow file called "taxi.arrow". Hope that helps.

4 Likes

I believe the easiest location to obtain a lot of that would be New York City Taxi and Limousine Commission (TLC) Trip Record Data - Registry of Open Data on AWS
or possibly this https://opendata.cityofnewyork.us/

A few months ago I tried a similar thing that didn’t work, but I tried parquet files as input, and I got those from an R package.

1 Like

Thanks for the super fast replies @quinnj and @StatisticalMouse . I did something that combined your hints: I downloaded parquet files with R (by following this Working with Arrow Datasets and dplyr • Arrow R Package ) and then tried to combine a big arrow file by using Parquet; Arrow.write("taxi.arrow", Tables.partitioner(read_parquet(".")));. These lines seem to crash Julia, I just get “Killed” message. I have Julia 1.6.0.

Do you have an idea why Julia crashes on this?

Hmmmm, not sure; it may be running out of memory. I don’t think Parquet.jl currently supports partitioned datasets, so I think it may be materializin the full parquet dataset in memory then trying to write it out to arrow memory.

That’s exactly what I tried earlier; it doesn’t work.

I also have a recollection that there were two parquet reading packages.

I ended up doing the big arrow file with pyarrow, along with lines below:

with pa.output_stream("path/big.arrow") as sink:
    with pa.ipc.new_file(sink, schema) as writer:
        for arrowfile in glob.glob("path/to/files/*.arrow", recursive=False):
            with pa.input_stream(arrowfile) as source:
                with pa.ipc.open_file(source) as reader:
                    for i in range(0,reader.num_record_batches):
                        writer.write_batch(reader.get_batch(i))

That led to the another issue: How well Apache Arrow’s zero copy methodology is supported?