How would I parallelize a for loop that's iterating over columns of a dataframe

Hi how’s it going?

I’ve read up on Distributed.jl, adprocs, and @everywhere. I tried implementing parallelization of a function I have by making sure all the necessary packages, functions, and variables were preceded by an @everywhere. I have 9 CPUs. When I ran the code and benchmarked it using @btime, nothing improved, in fact it got a little worse.

Can someone guide me on how I would parallelize a for loop thats iterating over a dataframes columns? Essentially if I have 9 columns and 9 CPUs, wouldn’t I be able to run each iteration simultaneously and make the loop 9 times faster?

I’m also new to distributed / parallel computing, so please forgive me if my understanding isn’t quite accurate.

Some example code of what you tried would be helpful. However I suspect you will have better performance with Threads rather than Distributed.

With threads make sure you start up julia with JULIA_NUM_THREADS defined as 9 or greater. You might want to use 18 if each core can handle 2 threads. If you are using bash the easiest way would be:

JULIA_NUM_THREADS="9" julia

Windows I think it would be:

set JULIA_NUM_THREADS="9"
julia

Or if you are using julia in Atom it’s a configuration setting.

I haven’t used DataFrames but I think this will work:

Threads.@threads for col in [ :a, :b, :c, :d:, :e, :f, :g, :h, :i ]
    local data = df[!, col]
    # Code goes here
end

Each iteration of the loop will be executed in a thread in parallel.

Ok let me give an example of the code, that’ll definitely help. Say I have a dictionary of dateframes and a list of keys I want to pass in to the for loop to iterate over each dataframe in the dictionary. I don’t want to index all the dataframes in the dictionary, only certain ones

user1 = “user1”

usercolumn = Symbol(user)

df1 = DataFrame()
df2 = DataFrame()

df_dict = Dict([(“first_frame”,df1),(“second_frame”,df2)])

for col in names(df_dict)
temp = df_dict[Symbol(col)][!,usercolumn].==user1
data = df_dict[Symbol(col)][temp,:]

and then all the code on here

df_dict is 15 elements long. my max threads is 16. I’m thinking if I can just allocate each one of those dictionary indexing, user lookups to one thread, it should work much faster right? I can look up one user simultaneously in 15 dataframes instead of one after the other.

What are your thoughts?

Small question: why do you need the local in the body of the for loop? Don’t for loops introduce their own scope?

That’s just my convention. I prefer to explicitly state the scope of the variables. I’m not 100% on all the rules of when a variable is global or local…like if I didn’t use local and for for some reason in the future created a global data variable would my loop suddenly use that? Or what if I declared a data variable in the function my loop is in, would that start being used?

If I declare local when I initialize the variable, I know that the variable will be local to that loop.

To expand on this, I’ve been programming for a while, and part of my concerns is maintainability (and readability) of the code. Like someone (or me) coming in and adding a data variable a year later to the parent function or a data variable to the global scope. In this situation I would prefer not to have unintended consequences.

That seems like it would work. However declaring the dict should be like:

df_dict = Dict("first_frame" => df1, "second_frame" => df2)

Then the loop needs to be declared like:

Threads.@threads for col in names(df_dict)

If you don’t need the names “first_frame”, “second_frame”, etc…you could just create a list:

Threads.@threads for df in [ df1, df2 ]

But if you want the names to indicate what the data frame is you could even do something like:

Threads.@threads for (name, df) in Dict("first_frame" => df1, "second_frame" => df2)
    temp = df...
    data = df...

This “saves” you the dictionary lookup (which is fast and not much overhead if you only have 16 entries) but is unnecessary.