Arrow.jl returns an
Arrow.Table, which is made up of concrete subtypes of
ArrowVector, which implement the
AbstractArray interface. So, for example, when doing
at = Arrow.Table(file); col1 = at.col1,
col1 is an object that is a “view” into the raw arrow data in
file. Indexing like
col1 computes the exact byte offset of the value in the raw arrow data and returns the data.
All that is to say, there’s nothing automatic in Arrow.jl to utilize multiple cores, but you’re completely free and flexible to do parallel/concurrent processing however you’d like. You could spawn multithreaded tasks to operate over an array; you could assign different processors to process arrays separately, etc.
Currently, DataFrames.jl defines some operations to process data in parallel using multiple threads when the conditions are right (i.e. when it would actually benefit performance on large datasets), and I believe the Transducers.jl framework as some nice parallelism workflows (cc: @tkf). But yeah, it really depends on your workflow and what you’re trying to do.