An example of Apache Arrow file?

Sami · March 31, 2021, 2:33pm

Hello,

I would like to test Julia’s support for Apache Arrow with a big file that doesn’t fit to memory (RAM). I can’t find any example of such file, could you help me with this please?

I guess I need a file that is in Arrow IPC Format (Feather file format), version 2 (see for example Feather File Format — Apache Arrow v9.0.0).

I would like to read the file into a dataframe like this: df = DataFrame(Arrow.Table(the_big_file)), as exemplified here: User Manual · Arrow.jl

Thanks for creating Julia

quinnj · March 31, 2021, 2:42pm

I’d just find a large dataset and write it out. For example, the nyc taxi dataset is a common example.

You’d download a couple files, then do something like:

Arrow.write("taxi.arrow",
    Tables.partitioner(csv_files) do file
        CSV.File(file)
    end
)

now you have a local arrow file called "taxi.arrow". Hope that helps.

StatisticalMouse · March 31, 2021, 2:58pm

I believe the easiest location to obtain a lot of that would be New York City Taxi and Limousine Commission (TLC) Trip Record Data - Registry of Open Data on AWS
or possibly this https://opendata.cityofnewyork.us/

A few months ago I tried a similar thing that didn’t work, but I tried parquet files as input, and I got those from an R package.

Sami · April 1, 2021, 2:35pm

Thanks for the super fast replies @quinnj and @StatisticalMouse . I did something that combined your hints: I downloaded parquet files with R (by following this Working with Arrow Datasets and dplyr • Arrow R Package ) and then tried to combine a big arrow file by using Parquet; Arrow.write("taxi.arrow", Tables.partitioner(read_parquet(".")));. These lines seem to crash Julia, I just get “Killed” message. I have Julia 1.6.0.

Do you have an idea why Julia crashes on this?

quinnj · April 1, 2021, 3:00pm

Hmmmm, not sure; it may be running out of memory. I don’t think Parquet.jl currently supports partitioned datasets, so I think it may be materializin the full parquet dataset in memory then trying to write it out to arrow memory.

StatisticalMouse · April 1, 2021, 4:57pm

That’s exactly what I tried earlier; it doesn’t work.

StatisticalMouse · April 1, 2021, 4:58pm

I also have a recollection that there were two parquet reading packages.

Sami · April 22, 2021, 11:56am

I ended up doing the big arrow file with pyarrow, along with lines below:

with pa.output_stream("path/big.arrow") as sink:
    with pa.ipc.new_file(sink, schema) as writer:
        for arrowfile in glob.glob("path/to/files/*.arrow", recursive=False):
            with pa.input_stream(arrowfile) as source:
                with pa.ipc.open_file(source) as reader:
                    for i in range(0,reader.num_record_batches):
                        writer.write_batch(reader.get_batch(i))

That led to the another issue: How well Apache Arrow’s zero copy methodology is supported?

Topic		Replies	Views
Reading and writing Apache arrow files General Usage question , package , arrow	4	5816	May 28, 2022
Reading Parquet file into Apache Arrow? Data dataframes	5	1016	November 27, 2020
Displaying a parquet file in Arrow New to Julia dataframes , parquet , arrow	7	1592	March 17, 2021
[ANN] Arrow.jl 0.3 Release Data arrow	21	3254	March 16, 2021
Apache Arrow 1.0 release Data arrow	7	1950	September 5, 2020

An example of Apache Arrow file?

Related topics