Questions about getting started with parallel computing

FujiwaraTakumiEH · June 16, 2019, 6:09am

Background:

I have already written a program with julia. In order to pursue speed, I now want to solve this problem with parallel computing.

OpenMP C++:

I know that when I used C++ before, I used OpenMP to speed up my program. As far as I know, its use is very simple (because I don’t specialize in parallel computing, so it might look very simple)：

#include <iostream>
#include <time.h>
void test()
{
    int a = 0;
    for (int i=0;i<100000000;i++)
        a++;
}
int main()
{
    clock_t t1 = clock();
    for (int i=0;i<8;i++)
        test();
    clock_t t2 = clock();
    std::cout<<"time: "<<t2-t1<<std::endl;
}

###############################################
#                 using OpenMP                #                           
###############################################

#include <iostream>
#include <time.h>
void test()
{
    int a = 0;
    for (int i=0;i<100000000;i++)
        a++;
}
int main()
{
    clock_t t1 = clock();
    #pragma omp parallel for
    for (int i=0;i<8;i++)
        test();
    clock_t t2 = clock();
    std::cout<<"time: "<<t2-t1<<std::endl;
}

Hope to get help:

I’ve seen the content of parallel computing in the Julia documentation, but I still don’t know how to solve the problem of parallel programs.

I want to know if Julia has the same method as OpenMP to speed up my program. I need to implement parallel computing in a short period of time(which means I need to make changes in the code not too complicated)). I don’t require it to have high performance, just like using OpenMP. The time saved is clearly related to the number of cores in the computer.

Do you have any good solutions (similar to the simple method of using OpenMP) or recommended reading materials, or have any suggestions for me, welcome everyone to comment in the comments, thank you!

Supplement:

1.

Julia Version 1.1.1
Commit 55e36cc308 (2019-05-16 04:10 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)

NumberOfCores:2
NumberOfLogicalProcessors:4

############################################################
#      I want to implement it on my own computer first,    #
#      then have a better performance workstation for      #
#      formal parallel work.                               #
############################################################

2.

I show the structure of the main part of my program：

module calculate

using LinearAlgebra
using DelimitedFiles
using DataFrames

include("area.jl")
include("volume.jl")
#My own formula for calculating area and volume)

export f
export g
...

function f(x1,x2,x3,x4)
    for i = 1:n  
        v = volume(x1[n,1],x2[n,1],x3[n,1],x4[n,1])
        area = area(x1[n,1],x2[n,1],x3[n,1]) 
    end
    return v
end

function g(...)
    ...(Structure is similar to f())
end

...

end
##########################################
#    I will call this module later       #
##########################################

3.

Currently I don’t want to involve parallel computing on GPUs, I just want to implement multicore parallel computing.

nilshg · June 16, 2019, 6:17am

This will heavily depend on your actual application (e.g. Patterns of data transfer and memory access, single machine CPU multi core vs GPU vs cluster).

I don’t think there’s a much better place to start than the docs which you say you’ve read, so if you could maybe post a simple example of an actual bit of Julia code you’re trying to parallelise and any errors you’re getting people might be able to help.

I’m not familiar with C++, but the thing most closely related to a parallel for in other languages like MATLAB is using Distributed; addprocs(2); @distributed for 1:10 print(i) end

e

LaurentPlagne · June 16, 2019, 7:16am

Hi,
the corresponding (multithreaded) recipe in Julia is to use Threads.@threads in front of your external loop:
Threads.@threads for i=0:8

Under the conditions that the body of the loop is

parallel (no dependencies)
long enough to amortize the thread machinery
not already memory bound

it should give you some speed-up (if it works with openMP it should work too)

As a usual advice, make sure that you sequential program is already well optimized (type stable, vectorized,…) before entering // business.

FujiwaraTakumiEH · June 16, 2019, 7:22am

Thank you very much, I added some content to the question.
I have a real problem:

What is the difference between using Distributed and the @threads mentioned in the Julia documentation?

FujiwaraTakumiEH · June 16, 2019, 7:26am

Thank you very much , and i want to know what is the difference between using Distributed and @threads?

johnh · June 16, 2019, 7:28am

Hello @FujiwaraTakumiEH I have worked in high performance computing for many years.
I am going to say something to you respectfully - think about using the features of Julia to make your code performa faster. then think of threads and parallel.
For instance you can use the @SIMD macro to make use of the vector units on your CPU
And of course there are GPUs.

But first let us ask some Julia experts here to help you speed up the code.
For instance have you looked at broadcasts?

LaurentPlagne · June 16, 2019, 7:29am

MT is based on threads while distributed parallelism involves different processes. In the former case all the threads access a shared memory. MT is well adapted to SMP computers (single nodes) and can be faster (and trickier) and consume less memory than distributed parallelism. It also can handle finer grained parallelism. On the other hand, distributed parallelism can run on several nodes (cluster) and can tackle larger scales.

johnh · June 16, 2019, 7:31am

I will also give my explanation of threads and Distributed.
Threads are lightweight processes - they share the same memory space as other threads.
So you have to be careful about one thread corrupting the data which another thread is working with.
Threads only work within one compute server - they are schedule by the operating system.

Distributed processes are more coarse grained. But they can scale across compute servers.
They way I understand it is Julia distributed processes are separate Julia instances, started by an ssh process.

johnh · June 16, 2019, 7:36am

Rhis is an article concerning vectorisation in Julia.
I think your loop should vectorise well - each i is independent.
https://juliacomputing.com/blog/2017/09/27/auto-vectorization-in-julia.html

You also consider the type of the array which you use. Is N a fixed size?

LaurentPlagne · June 16, 2019, 7:46am

In addition to John’s advices you may try (still experimental) @ffevote tool:
GFlops.jl
before using any parallelism (it does not work with MT). It allows to compute the actual computational speed of your implementation (in GFlops) so that you can estimate its efficiency.

FujiwaraTakumiEH · June 16, 2019, 7:52am

Thank you very much for your reply. I have used some of the broadcasts in programming, as well as the map function. Finally, the main part of this program can only be written in the format of a for loop. I have two specific questions:

Is this function using boardcast or map mean that it is already parallel?
for example, in my use case, I can only use multiple cores on one computer for parallel computing. Then, if the settings are correct, @threads should be more efficient than Distributed?

LaurentPlagne · June 16, 2019, 7:58am

No but yes. broadcast or map are not parallel but may produce a vectorized binary (SIMD) that is a special form of parallelism.
Yes, in principle if you don’t need to scale between several nodes/machines the @thread overhead is much lower than distributed and should be faster.

pszufe · June 16, 2019, 10:25am

Threading support in Julia is experimental and in practice it means that writing a production parallel code is difficult due to various glitches. The easiest piece of code that crashes a multi-threaded Julia code is following: Threads.@threads for i in 1:10 sleep(1) end. There are also hard to manage compiler chase issues (two threads starting to compile the same function at the same time) that also occasionally result in a crash.

However support for multi-processing and distributed computing in Julia is brilliant. Depending on your scenario look at the following info when learning:

Start with docs Parallel Computing · The Julia Language . You have to learn carefully green threading macros because you need them to control your distributed processes, you can skip multi-threading for the reasons above.
two main blocks for any distributed Julia code are @distributed and pmap - learn them carefully
have a look at ParallelDataTransfer.jl - and learn it along @spawnat macro and fetch function. This macro is usually used along with the green threading mentioned above.
now (depending on your needs) have a look at the following packages: SharedArrays.jl (many processes sharing data on a local machine), DistributedArrays.jl (array distributed over many local or remote processes)
for some data analytics jobs you might also consider tools such as JuliaDB.

Hope this guide helps!

LaurentPlagne · June 16, 2019, 10:37am

BTW, what is the PARTR status ? A composable a safe nested //ism would make Julia even more (is that possible ?) attractive

pszufe · June 16, 2019, 12:13pm

Regarding PARTR This is the post to track

StefanKarpinski · June 16, 2019, 2:08pm

Implemented on master, lacking only an official API.

johnh · June 16, 2019, 8:09pm

I am really interested in Gflops.jl
With Intel CPUs you have counters which will return the exact number of floating point operation instructions issued and retired. These are different due to speculative exectution. The brings in discussions of Spectre, Meltdown exploits.
However those counters are specific to Intel - we live in a worls with AMD, ARM and other CPUs of course.

carstenbauer · June 17, 2019, 7:51am

FWIW, you could take a look at https://github.com/crstnbr/julia-workshop/blob/master/5%20Parallel%20computing/parallel-computing.ipynb (an overview/tutorial I once prepared).

FujiwaraTakumiEH · June 22, 2019, 2:01pm

Hello , after reading your link, I would like to ask the following questions:
In your ipynb file you have the following:

Distributed loop, but no reduction
The following example might not be doing what you'd expect it to. Why?

a = zeros(10)
@distributed for i = 1:10
    a[i] = i
end

Note that @distributed without a reduction function returns a Task . It is basically a distributed version of @spawn for all the iterations.

I don’t know how to solve this problem.

Do you mean using SharedArrays to solve this problem?
If I want to add an element to an array that doesn’t know the size, what should I do? I tried the following code but it didn’t work.

No parallel situation:

function f(n)
    ax = Vector{Float64}()
    for i = 1:n
        append!(ax,i)
    end
    return ax
end

f(5)
5-element Array{Float64,1}:
 1.0
 2.0
 3.0
 4.0
 5.0

Parallel situation:

function f(n)
    ax = SharedArray{Float64}()
    @distributed for i = 1:n
        append!(ax,i)
    end
    return ax
end

f(5)
0-dimensional SharedArray{Float64,0}:
0.0

I think ShareArrays should pre-define the memory area of the specified size in memory, so there may be no append! in ShareArrays, but if I want to achieve my purpose (add elements to an array of unknown size and use parallel Method) What should I do?

Topic		Replies	Views
Simple Parallel Examples for Embarrassingly Simple Problems Julia at Scale	29	7346	April 23, 2021
Writing effective parallel code Performance parallel	8	1705	December 18, 2019
Blog: Using Julia on the HPC Teaching & Outreach blog-post	40	2244	October 10, 2022
Distributing loops across threads manually (something like OpenMP) Performance multithreading	14	1323	November 2, 2021
Threads/Parallel New to Julia	22	8722	October 24, 2017

Questions about getting started with parallel computing

Background:

OpenMP C++:

Hope to get help:

Supplement:

1.

2.

3.

Related topics