I have 16 cores on my machine and queries seem to utilize only one of them. I would like to be 16 times faster with my queries - so is it possible to run parallel queries on Arrow data (read in with Arrow.jl)??
For example, is there some kind of macro-magic (that is how I see macros currently as a newcomer to Julia world) that could be used? Like done here
Arrow.jl returns an Arrow.Table, which is made up of concrete subtypes of ArrowVector, which implement the AbstractArray interface. So, for example, when doing at = Arrow.Table(file); col1 = at.col1, col1 is an object that is a “view” into the raw arrow data in file. Indexing like col1[1] computes the exact byte offset of the value in the raw arrow data and returns the data.
All that is to say, there’s nothing automatic in Arrow.jl to utilize multiple cores, but you’re completely free and flexible to do parallel/concurrent processing however you’d like. You could spawn multithreaded tasks to operate over an array; you could assign different processors to process arrays separately, etc.
Currently, DataFrames.jl defines some operations to process data in parallel using multiple threads when the conditions are right (i.e. when it would actually benefit performance on large datasets), and I believe the Transducers.jl framework as some nice parallelism workflows (cc: @tkf). But yeah, it really depends on your workflow and what you’re trying to do.
I actually use the arrow format for a reasonably heavy analytics layer and process across both rows and columns in a mutithreaded/parallel way.
Recall that you have access to the actual indices of a column (one of the genuinely incredible aspects or arrow imo) and can therefore access a column in a partitioned thread approach like “Threads.@threads” to query over a range.
Thank you @quinnj for simplifying Arrow inner workings, and thanks @djholiver for telling me that it could be done. However, I am new to Julia and currently just capable of utilizing the ready packages/apis (like DataFrames.jl or DataFramesMera.jl). If there would be a easier way to utilizes the modern hardware to the max, I believe that would help in gaining attraction.
I wonder whether there are pointers to record batches also and could those record batches be processed in map-reduce fashion (and also in zero-copy fashion)? Furthermore, could those pointers be provided to the GPU and let it to the processing?