I’ve read up on Distributed.jl, adprocs, and @everywhere. I tried implementing parallelization of a function I have by making sure all the necessary packages, functions, and variables were preceded by an @everywhere. I have 9 CPUs. When I ran the code and benchmarked it using @btime, nothing improved, in fact it got a little worse.
Can someone guide me on how I would parallelize a for loop thats iterating over a dataframes columns? Essentially if I have 9 columns and 9 CPUs, wouldn’t I be able to run each iteration simultaneously and make the loop 9 times faster?
I’m also new to distributed / parallel computing, so please forgive me if my understanding isn’t quite accurate.
Some example code of what you tried would be helpful. However I suspect you will have better performance with Threads rather than Distributed.
With threads make sure you start up julia with JULIA_NUM_THREADS defined as 9 or greater. You might want to use 18 if each core can handle 2 threads. If you are using bash the easiest way would be:
JULIA_NUM_THREADS="9" julia
Windows I think it would be:
set JULIA_NUM_THREADS="9"
julia
Or if you are using julia in Atom it’s a configuration setting.
I haven’t used DataFrames but I think this will work:
Threads.@threads for col in [ :a, :b, :c, :d:, :e, :f, :g, :h, :i ]
local data = df[!, col]
# Code goes here
end
Each iteration of the loop will be executed in a thread in parallel.
Ok let me give an example of the code, that’ll definitely help. Say I have a dictionary of dateframes and a list of keys I want to pass in to the for loop to iterate over each dataframe in the dictionary. I don’t want to index all the dataframes in the dictionary, only certain ones
for col in names(df_dict)
temp = df_dict[Symbol(col)][!,usercolumn].==user1
data = df_dict[Symbol(col)][temp,:]
and then all the code on here
df_dict is 15 elements long. my max threads is 16. I’m thinking if I can just allocate each one of those dictionary indexing, user lookups to one thread, it should work much faster right? I can look up one user simultaneously in 15 dataframes instead of one after the other.
That’s just my convention. I prefer to explicitly state the scope of the variables. I’m not 100% on all the rules of when a variable is global or local…like if I didn’t use local and for for some reason in the future created a global data variable would my loop suddenly use that? Or what if I declared a data variable in the function my loop is in, would that start being used?
If I declare local when I initialize the variable, I know that the variable will be local to that loop.
To expand on this, I’ve been programming for a while, and part of my concerns is maintainability (and readability) of the code. Like someone (or me) coming in and adding a data variable a year later to the parent function or a data variable to the global scope. In this situation I would prefer not to have unintended consequences.