Questions about getting started with parallel computing

Background:

I have already written a program with julia. In order to pursue speed, I now want to solve this problem with parallel computing.

OpenMP C++:

I know that when I used C++ before, I used OpenMP to speed up my program. As far as I know, its use is very simple (because I don’t specialize in parallel computing, so it might look very simple):

#include <iostream>
#include <time.h>
void test()
{
    int a = 0;
    for (int i=0;i<100000000;i++)
        a++;
}
int main()
{
    clock_t t1 = clock();
    for (int i=0;i<8;i++)
        test();
    clock_t t2 = clock();
    std::cout<<"time: "<<t2-t1<<std::endl;
}

###############################################
#                 using OpenMP                #                           
###############################################

#include <iostream>
#include <time.h>
void test()
{
    int a = 0;
    for (int i=0;i<100000000;i++)
        a++;
}
int main()
{
    clock_t t1 = clock();
    #pragma omp parallel for
    for (int i=0;i<8;i++)
        test();
    clock_t t2 = clock();
    std::cout<<"time: "<<t2-t1<<std::endl;
}

Hope to get help:

I’ve seen the content of parallel computing in the Julia documentation, but I still don’t know how to solve the problem of parallel programs.

I want to know if Julia has the same method as OpenMP to speed up my program. I need to implement parallel computing in a short period of time(which means I need to make changes in the code not too complicated)). I don’t require it to have high performance, just like using OpenMP. The time saved is clearly related to the number of cores in the computer.

Do you have any good solutions (similar to the simple method of using OpenMP) or recommended reading materials, or have any suggestions for me, welcome everyone to comment in the comments, thank you! :blush:

Supplement:

1.

Julia Version 1.1.1
Commit 55e36cc308 (2019-05-16 04:10 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, ivybridge)

NumberOfCores:2
NumberOfLogicalProcessors:4

############################################################
#      I want to implement it on my own computer first,    #
#      then have a better performance workstation for      #
#      formal parallel work.                               #
############################################################

2.

I show the structure of the main part of my program:

module calculate

using LinearAlgebra
using DelimitedFiles
using DataFrames

include("area.jl")
include("volume.jl")
#My own formula for calculating area and volume)

export f
export g
...

function f(x1,x2,x3,x4)
    for i = 1:n  
        v = volume(x1[n,1],x2[n,1],x3[n,1],x4[n,1])
        area = area(x1[n,1],x2[n,1],x3[n,1]) 
    end
    return v
end

function g(...)
    ...(Structure is similar to f())
end

...

end
##########################################
#    I will call this module later       #
##########################################

3.

Currently I don’t want to involve parallel computing on GPUs, I just want to implement multicore parallel computing.

2 Likes

This will heavily depend on your actual application (e.g. Patterns of data transfer and memory access, single machine CPU multi core vs GPU vs cluster).

I don’t think there’s a much better place to start than the docs which you say you’ve read, so if you could maybe post a simple example of an actual bit of Julia code you’re trying to parallelise and any errors you’re getting people might be able to help.

I’m not familiar with C++, but the thing most closely related to a parallel for in other languages like MATLAB is using Distributed; addprocs(2); @distributed for 1:10 print(i) end

e

2 Likes

Hi,
the corresponding (multithreaded) recipe in Julia is to use Threads.@threads in front of your external loop:
Threads.@threads for i=0:8

Under the conditions that the body of the loop is

  • parallel (no dependencies)
  • long enough to amortize the thread machinery
  • not already memory bound

it should give you some speed-up :wink: (if it works with openMP it should work too)

As a usual advice, make sure that you sequential program is already well optimized (type stable, vectorized,…) before entering // business.

2 Likes

Thank you very much, I added some content to the question.
I have a real problem:

What is the difference between using Distributed and the @threads mentioned in the Julia documentation?

1 Like

Thank you very much :smiley:, and i want to know what is the difference between using Distributed and @threads?

1 Like

Hello @FujiwaraTakumiEH I have worked in high performance computing for many years.
I am going to say something to you respectfully - think about using the features of Julia to make your code performa faster. then think of threads and parallel.
For instance you can use the @SIMD macro to make use of the vector units on your CPU
And of course there are GPUs.

But first let us ask some Julia experts here to help you speed up the code.
For instance have you looked at broadcasts?

MT is based on threads while distributed parallelism involves different processes. In the former case all the threads access a shared memory. MT is well adapted to SMP computers (single nodes) and can be faster (and trickier) and consume less memory than distributed parallelism. It also can handle finer grained parallelism. On the other hand, distributed parallelism can run on several nodes (cluster) and can tackle larger scales.

I will also give my explanation of threads and Distributed.
Threads are lightweight processes - they share the same memory space as other threads.
So you have to be careful about one thread corrupting the data which another thread is working with.
Threads only work within one compute server - they are schedule by the operating system.

Distributed processes are more coarse grained. But they can scale across compute servers.
They way I understand it is Julia distributed processes are separate Julia instances, started by an ssh process.

1 Like

Rhis is an article concerning vectorisation in Julia.
I think your loop should vectorise well - each i is independent.
https://juliacomputing.com/blog/2017/09/27/auto-vectorization-in-julia.html

You also consider the type of the array which you use. Is N a fixed size?

1 Like

In addition to John’s advices you may try (still experimental) @ffevote tool:
GFlops.jl
before using any parallelism (it does not work with MT). It allows to compute the actual computational speed of your implementation (in GFlops) so that you can estimate its efficiency.

1 Like

Thank you very much for your reply. :grin: I have used some of the broadcasts in programming, as well as the map function. Finally, the main part of this program can only be written in the format of a for loop. I have two specific questions:

  1. Is this function using boardcast or map mean that it is already parallel?
  2. for example, in my use case, I can only use multiple cores on one computer for parallel computing. Then, if the settings are correct, @threads should be more efficient than Distributed?
  1. No but yes. broadcast or map are not parallel but may produce a vectorized binary (SIMD) that is a special form of parallelism.
  2. Yes, in principle if you don’t need to scale between several nodes/machines the @thread overhead is much lower than distributed and should be faster.

Threading support in Julia is experimental and in practice it means that writing a production parallel code is difficult due to various glitches. The easiest piece of code that crashes a multi-threaded Julia code is following: Threads.@threads for i in 1:10 sleep(1) end. There are also hard to manage compiler chase issues (two threads starting to compile the same function at the same time) that also occasionally result in a crash.

However support for multi-processing and distributed computing in Julia is brilliant. Depending on your scenario look at the following info when learning:

  • Start with docs Parallel Computing · The Julia Language . You have to learn carefully green threading macros because you need them to control your distributed processes, you can skip multi-threading for the reasons above.
  • two main blocks for any distributed Julia code are @distributed and pmap - learn them carefully
  • have a look at ParallelDataTransfer.jl - and learn it along @spawnat macro and fetch function. This macro is usually used along with the green threading mentioned above.
  • now (depending on your needs) have a look at the following packages: SharedArrays.jl (many processes sharing data on a local machine), DistributedArrays.jl (array distributed over many local or remote processes)
  • for some data analytics jobs you might also consider tools such as JuliaDB.

Hope this guide helps!

1 Like

BTW, what is the PARTR status ? A composable a safe nested //ism would make Julia even more (is that possible ?) attractive :wink:

1 Like

Regarding PARTR This is the post to track :slight_smile:

Implemented on master, lacking only an official API.

5 Likes

I am really interested in Gflops.jl
With Intel CPUs you have counters which will return the exact number of floating point operation instructions issued and retired. These are different due to speculative exectution. The brings in discussions of Spectre, Meltdown exploits.
However those counters are specific to Intel - we live in a worls with AMD, ARM and other CPUs of course.

2 Likes

FWIW, you could take a look at https://github.com/crstnbr/julia-workshop/blob/master/5%20Parallel%20computing/parallel-computing.ipynb (an overview/tutorial I once prepared).

5 Likes

Hello :smile:, after reading your link, I would like to ask the following questions:
In your ipynb file you have the following:

Distributed loop, but no reduction
The following example might not be doing what you'd expect it to. Why?
a = zeros(10)
@distributed for i = 1:10
    a[i] = i
end

Note that @distributed without a reduction function returns a Task . It is basically a distributed version of @spawn for all the iterations.


I don’t know how to solve this problem.:sweat_smile:

  1. Do you mean using SharedArrays to solve this problem?
  2. If I want to add an element to an array that doesn’t know the size, what should I do? I tried the following code but it didn’t work.

No parallel situation:

function f(n)
    ax = Vector{Float64}()
    for i = 1:n
        append!(ax,i)
    end
    return ax
end

f(5)
5-element Array{Float64,1}:
 1.0
 2.0
 3.0
 4.0
 5.0

Parallel situation:

function f(n)
    ax = SharedArray{Float64}()
    @distributed for i = 1:n
        append!(ax,i)
    end
    return ax
end

f(5)
0-dimensional SharedArray{Float64,0}:
0.0

I think ShareArrays should pre-define the memory area of the specified size in memory, so there may be no append! in ShareArrays, but if I want to achieve my purpose (add elements to an array of unknown size and use parallel Method) What should I do?