Processing Rows with multiple threads

danielw2904 · August 20, 2020, 8:32am

I find myself having to process a lot of data recently where a) the processing of two different rows is independent and b) processing each row is expensive. I find myself coding a version of the following every time:

using DataFrames
using Base.Threads: @spawn
myrows = rand(20)
function processmyrow(row) # => some expensive function
    sleep(5)
    return DataFrame(r = row)
end
outdfs = [DataFrame() for _ in 1:Threads.nthreads()]
myparts = Iterators.partition(myrows, cld(length(myrows), Threads.nthreads()))


@sync for part in myparts
    @spawn begin
        for row in part
            append!(outdfs[Threads.threadid()], processmyrow(row))
        end
    end
end

Is there a better way of doing this? I have started looking into a couple of parallelism exploiting packages recently its quite hard for me to comprehend. Could someone point me in the right direction here?

Edit: I do not care about the order of the result as I will usually just sort it afterwards.

Thanks!

haberdashPI · August 20, 2020, 12:11pm

danielw2904:

outdfs = [DataFrame() for _ in 1:Threads.nthreads()]
myparts = Iterators.partition(myrows, cld(length(myrows), Threads.nthreads()))


@sync for part in myparts
    @spawn begin
        for row in part
            append!(outdfs[Threads.threadid()], processmyrow(row))
        end
    end
end

Instead of the above, you can do the following.

using BangBang, Transducers
foldxt(append!!, Map(processmyrow), myrow))

The doc for Transducers has a good parallelism tutorial

danielw2904 · August 20, 2020, 12:23pm

Thank you that looks great!

I also found

tcollect(Map(processmyrow), myrows))

in case the rows cannot easily be appended (e.g. not all rows return the same collumns)

danielw2904 · August 20, 2020, 12:25pm

I think you pasted the link twice its: parallelism tutorial

lungben · August 20, 2020, 12:36pm

If the function on a row takes sufficently long time (e.g. 1 ms or more), you should be fine to spawn a task per row and construct a DataFrame from the result:

df = DataFrame(a=1:10, b=11:20)
processmyrow(row) = (a=row.a, b=row.b, c=row.a^2 + row.b)
outdf = DataFrame(fetch.([Threads.@spawn processmyrow(row) for row in eachrow(df)]))

haberdashPI · August 20, 2020, 12:49pm

Bah! That’s what I get for typing with a baby in one arm…

danielw2904 · August 20, 2020, 2:36pm

Transducers has changed my life! I already added it to 2 projects I’m working on.

Topic		Replies	Views
Would somebody suggest what is the right to work in a parallel loop and write data to the same Dataframe? New to Julia	4	916	January 5, 2020
Threading support for DataFrames transforms Data multithreading , dataframes	10	1397	May 6, 2022
Parallel DataFrame processing Data parallel , dataframes	2	2315	October 3, 2019
How to change the same DataFrame thought different threads New to Julia	0	324	August 29, 2019
Add rows to DataFrame (or similar table-like struct) in parallel? General Usage	3	497	November 19, 2018

Processing Rows with multiple threads

Related topics