I started with Julia about a week ago. So far so good although I suffered quite a bit before I managed to get the VSCode debugger work properly. (It would get stuck forever when reading large dataframes)
Now my issues are with parallelization. want to call pmap from within a function. The idea is to go through all possible parameters so as to optimize another function. The sequential version of the code works fine. Simplifying a bit:
import SomeModule
function optimize(my_dict)
addprocs(4) @everywhere data = my_dict[“Data”] @everywhere param_list = my_dict[“Parameter List”] @everywhere SomeModule.another_function(data, param_list)
end
I successively found out that I needed the @everywhere macros as the corresponding variables were initially not visible to the workers. As I added each one of these macros the compiler stopped complaining that the corresponding variable was not visible to some worker.
my_dict is in the main program file, which is not a module. I was able to export opt_foo to the workers with an everywhere macro but I do not know how to export the dictionary argument my_dict to the workers. The function optimize is also in the main program, which does not have a module statement at the top. Any suggestions will be highly appreciated.
That is a good question. As I said, I am a beginner with Julia and I am not clear on the answer. I have seen languages where threads do not provide true parallelization and are used only for IO and similar tasks. In these languages all threads share a single core. I suppose it used to be like this before multicore computers appeared on the scene, since multithreading is quite old. I assume that is not the case with Julia. Surely my code would benefit from using threads since the threads could share a large dataframe and there would be no lost time passing messages. On the other hand, I believe (and again I am not sure) that one has to take extra care not to let the threads overwrite each other. I do not believe that would be the case for my code, since I am using pmap (can that be used with threads?) and reduction is done after the parallelized code completes its tasks. So the bottom line is: I believe it can be done with threads but I was somewhat wary of doing it for the reasons above. In Python and R I always used processes.
Julia has true threads, and a lot of things are thread-safe / friendly out of the box. I don’t know exactly what you’re doing, but for example, if you have an array, read/write disjoint index from different threads is basically fully efficient – this should be a good touchstone in showing what Julia offers.
Uh… I don’t think this is true yet actually. Some things (e.g., Array) work great but many things are hard to reason about the safety (e.g., sparse matrix). Importantly, there’s no documentation on what is safe when.
Sorry to nitpick, but I’d be careful about such a claim. For some definitions of “true threads”, one can argue Python has “more true” threads than Julia. For example, Python has a more transparent OS thread API than Julia; Julia only has tasks (and that’s kinda the point). But unfortunately to Python programmers, it was designed in 90s where threads for parallelism were not a thing (at least not for everyone) and so it’s useful mainly for I/O or GIL-releasing external code. On the other hand, Julia has “more true” threads than Python in another sense if one defines “threads” as a synonym of shared-memory parallelism.
If you have a rough idea of the applicability of process-based parallelism in the problem you have, I think starting from your comfort zone sounds like a good idea.
Going back to the problem in the OP, it’s not a good idea to use @everywhere inside a function. It’s mainly for “static things” like using Package and include("script.jl"). You’d probably want to use remotecall here (and maybe iterate over the worker ids returned from workers()).
I rewrote my code to use threads and it is working fine. Thanks for suggesting that.
I worry about having to define the number of cores (threads) before starting Julia (VSCode in my case). This can be done through the Windows command prompt, Powershell, or the Julia extension settings (all three work fine). I have a few questions:
Can I change the number of threads in interactive mode? 3) Can I kill threads and go back to single thread execution after the code that uses multithreading has executed?
If I cannot change the number of threads then I have to run my whole program with the number of threads being greater than one while only a portion of the program needs multithreading. Does this hurt the performance of other sections of my program?
In order to take advantage of having more than one thread I had to insert Threads.@threads before a for statement. Is this the only portion of my code that will run multithreaded? In other words, do I have to worry about race conditions and other multithreaded-related problems when any portion of my code is running or only when sections marked with Threads.@threads (or similar commands) are running?
Thanks again for your help. I am much happier now that I can do multiprocessing with Julia.
no, technically Julia starts with a pool of OS threads. 2. no you don’t have to worry about worse performance.
yes, you control what parts are run in parallel. For finer control, you can use @spawn and or many great packages such as FLoops to control how the parallelism works