Maybe my head is stuck in a loop right now but I have a question about dealing with arrow data.
I have some data stored in a few large arrow files. I currently read the data via Arrow.Table("file.arrow") |> DataFrame and then filter the resulting DataFrame to reduce it to the subset of data that I want to work with. However, this is very inefficient: first read all the data, then reduce it again.
How can I search through the Arrow file while accessing it and directly filter the data stream? Thanks for any pointers, I definitely have the feeling that I’m missing something crucial here.
Hi @fbanning ,
the documentation on Arrow.Table actually has quite some information on this.
In particular:
Arrow.Table first “mmapped” the data.arrow file, which is an important technique for dealing with data larger than available RAM on a system. By “mmapping” a file, the OS doesn’t actually load the entire file contents into RAM at the same time, but file contents are “swapped” into RAM as different regions of a file are requested.
and:
Tables.datavaluerows(Arrow.Table(file)) |> @map(...) |> @filter(...) |> DataFrame : use Query.jl 's row-processing utilities to map, group, filter, mutate, etc. directly over arrow data.
(I haven’t tried it myself, but the way I interpret that example is that data is loaded only row-wise into memory)
Note that DataFrame(Arrow.Table(...)) creates a DataFrame whose columns are special Arrow types which already benefit from mmapping, without making any copy. So AFAIK calling filter on the result is already the fastest and most memory-efficient solution to extract a subset of data. Query .jl is only needed if you want to apply an intermediate processing step via @map before taking a subset, and without making an additional copy.
The key point is that Arrow is designed to be read efficiently from disk without any copy, contrary to e.g. CSV.
I think this is the part that I have missed. Will look into how to work with this, thanks.
For comparison, this is what I’m currently doing Arrow.Table(file) |> DataFrame |> filter(...). I just didn’t know that I can simply switch the pieces of the pipeline - very nice if that’s how it works.
So there’s no need to do as described above?
Edit: Ah, so the intermediate steps are good to avoid making an additional copy when filtering, is that correct?
If by “intermediate steps” you mean @map, then no, it’s not really needed to avoid copies. Generally you can just call filter first and then apply any transformations you need to the (smaller) resulting data set. Or if you want to avoid doing any copy at all, you can do filter(..., view=true). Query.jl is nice but I don’t think it’s really more efficient here.
In general what you should keep in mind that all DataFrames.jl operations are eager. They get materialized immediately (unless you ask for a view when no copying happens).
So for example if you do filter in DataFrames.jl without view=true it will copy data to a new data frame.
I appreciate all the replies above and they are mostly spot-on.
I imagine part of the confusion here is probably having worked with arrow data in other language implementations. Most other language implementations I’m aware of include the ability to pass in various “table operations” when reading the arrow data directly; i.e. filter, aggregate, etc. And those operations are “pushed down” and applied while reading the data so that the “table” object that is returned respects those operations.
In the Julia implementation, we could potentially support something like that, but as mentioned above, the current “read” operation is essentially just an mmap of the raw data along with metadata processing, which is extremely cheap. You can even wrap these “arrow arrays/columns” in a DataFrame and get access to the whole family of operations supported there very easily. And this isn’t limited to DataFrames; as also mentioned, you can use the Query.jl row-processing framework, or the operations in DataFramesMeta; TypedTables or IndexedTables are two other “table” frameworks. There’s also the TableOperations.jl package, which supports some common operations like column selection, mapping, and filtering.
All these packages support operating on any generic “table” input, including Arrow.Table. And due to the immutable nature of arrow data, unnecessary copies of the data can be avoided by default.
So essentially we don’t support these kind of operations while reading in Arrow.jl because there are quite a few options for processing the data post-read efficiently. In this way, all the code in the Arrow.jl package is purely focused on reading/writing the arrow format and ends up being a significantly smaller/simpler implementation compared to other language implementations with similar format coverage.