Tutorials or examples on multithreading and distributed processing for numerical PDEs

Hello everyone,

I have migrated to Julia about less than a month ago from Fortran + Matlab. I started learning C++ last year to write MPI codes but found it too cumbersome.

In Julia, my main goal is to write parallel codes for numerical PDEs either using multithreading (@threads @floop etc) or multiprocessing (@distributed @distributedarrays etc). Though I am understanding the toy examples given in the documentation, I am not going very much forward with real implementation.

I need to see a few examples of PDEs, something as simple as Laplace or Poisson equation solved using finite/boundary element or finite difference or finite volume implemented using a solver in parallel. Preferably, something that is explained/written in a pedagogical style. For e.g., I wish to look at a code solving the problem in serial and then in parallel and then showing the time vs threads/processor plot.

In my own research, I use boundary element method (BEM) but any PDE example will just work fine. I first wrote a BEM code in serial which runs much faster than Matlab and about as fast as Fortran. Then, I just finised writing a code using distributedarrays but it is really slow. I have also tried multithreading but I run into data-race issues. I am happy to show the code but I thought it will be better for me to learn by seeing some good examples.

So far, I have seen the videos on julia academy on parallel computing JuliaAcademy and MIT computational thinking course Introduction to your professors | Week 1 | 18.S191 MIT Fall 2020 - YouTube

Thank you very much in advance.

1 Like

You’re doing it in parallel by default because of BLAS. Distributed MPI is just Elemental.jl. There’s really not much to show.

2 Likes

Thank you very much for the prompt response. I have written a minimal boundary element method code for a 2D sine wave. It solves the Stokes equation using boundary integral formulation.

Are you saying I will not benefit from parallelising the for-loops or storing the matrix RM in different processors when it gets too large, like (40,000 x 40,000) or so?

julia> include("filamentstokesserial.jl"); @btime filamentefficiency(2^8,20,[1.00, -0.01, 0.01, 0.005, 0.015, -0.008],0)
  42.376 ms (56 allocations: 4.07 MiB)
0.009152331502935747
using FastGaussQuadrature,
    Plots, LinearAlgebra, BenchmarkTools, TimerOutputs, StaticArrays

function sgf_3d_fs(x, y, z, x0, y0, z0)
    dx = x - x0
    dy = y - y0
    dz = z - z0
    dxx = dx^2
    dxy = dx * dy
    dxz = dx * dz
    dyy = dy^2
    dyz = dy * dz
    dzz = dz^2

    r = sqrt(dxx + dyy + dzz)
    r3 = r * r * r
    ri = 1.0 / r
    ri3 = 1.0 / r3

    g11 = ri + dxx * ri3
    g12 = dxy * ri3
    g13 = dxz * ri3

    g21 = g12
    g22 = ri + dyy * ri3
    g23 = dyz * ri3

    g31 = g13
    g32 = g23
    g33 = ri + dzz * ri3

    GEk = @SMatrix [
        g11 g12 g13
        g21 g22 g23
        g31 g32 g33
    ]
    return GEk
end

function inner_loop(xc, yc, tcx, tcy, sc, x1, y1, x2, y2, s1, s2, xg, wg, ng)
    # x1,y1,x2,y2 are coordinates of element j with nodes at j and j+1
    # xc, yc are the coordinates of collocation point
    GE = zeros(SMatrix{3,3,Float64})
    GEself = zeros(SMatrix{3,3,Float64})
    d = 0.001 # regularisation parameter
    @inbounds @simd for k = 1:ng
        xi = xg[k]
        xint = 0.5 * ((x2 + x1) + (x2 - x1) * xi)
        yint = 0.5 * ((y2 + y1) + (y2 - y1) * xi)
        sint = 0.5 * ((s2 + s1) + (s2 - s1) * xi)
        hl = 0.5 * sqrt((x2 - x1)^2 + (y2 - y1)^2)
        # Call Green's function for Stokes equation
        GEk = sgf_3d_fs(xint, yint, 0, xc, yc, 0) # two dimensional problem: z=0
        # REDUCE ME
        GE += GEk * hl * wg[k]
        ds = abs(sc - sint)
        dss = sqrt(ds^2 + d^2)
        Gs11 = ((1.0 + tcx * tcx) / dss) * hl * wg[k]
        Gs12 = ((0.0 + tcx * tcy) / dss) * hl * wg[k]
        Gs21 = ((0.0 + tcy * tcx) / dss) * hl * wg[k]
        Gs22 = ((1.0 + tcy * tcy) / dss) * hl * wg[k]
        # REDUCE ME
        GEself += @SMatrix [
            Gs11 Gs12 0
            Gs21 Gs22 0
            0 0 0
        ]
    end
    return GE, GEself
end

function outer_middle_loops(
    xm,
    ym,
    tmx,
    tmy,
    sm,
    vm,
    ls,
    s,
    c,
    x,
    y,
    xg,
    wg,
    ng,
    nx,
)
    # Initialize the matrix and right hand side
    RM = zeros(2 * (nx - 1) + 1, 2 * (nx - 1) + 1)
    rhs = zeros(2 * (nx - 1) + 1, 1)

    for j = 1:nx-1 # looping over columns first
        j1 = (j - 1) * 2 + 1
        j2 = (j - 1) * 2 + 2

        tcx = tmx[j]
        tcy = tmy[j]

        # Local operator terms( \Lambda [f_h] )
        # REDUCE ME
        RM[j1, j1] += -(1 * (2 - c) - tcx * tcx * (c + 2)) / 8π
        RM[j1, j2] += -(0 * (2 - c) - tcx * tcy * (c + 2)) / 8π
        RM[j2, j1] += -(0 * (2 - c) - tcy * tcx * (c + 2)) / 8π
        RM[j2, j2] += -(1 * (2 - c) - tcy * tcy * (c + 2)) / 8π

        rhs[j1] = 0
        rhs[j2] = vm[j]
        RM[2*(nx-1)+1, j1] = ls[j]
        RM[j1, 2*(nx-1)+1] = 1

        for i = 1:nx-1
            i1 = (i - 1) * 2 + 1
            i2 = (i - 1) * 2 + 2
            xc = xm[i]
            yc = ym[i]
            tcx = tmx[i]
            tcy = tmy[i]
            sc = sm[i]
            # Single element k-Loop Integration using quadratures
            GE, GEself = inner_loop(
                xc,
                yc,
                tcx,
                tcy,
                sc,
                x[j],
                y[j],
                x[j+1],
                y[j+1],
                s[j],
                s[j+1],
                xg,
                wg,
                ng,
            )

            # Non-local operator terms( K [f_h] )
            # REDUCE ME
            RM[i1, j1] += GE[1, 1] / 8π
            RM[i1, j2] += GE[1, 2] / 8π
            RM[i2, j1] += GE[2, 1] / 8π
            RM[i2, j2] += GE[2, 2] / 8π

            RM[i1, i1] += -GEself[1, 1] / 8π
            RM[i1, i2] += -GEself[1, 2] / 8π
            RM[i2, i1] += -GEself[2, 1] / 8π
            RM[i2, i2] += -GEself[2, 2] / 8π
        end
    end
    return RM, rhs
end

function filamentefficiency(nx, ng, A, t)
    # Empty grad vector
    # Specify the number of nodes
    # nx = 101
    # Note that the number of elements = nx-1
    # How many points needed for integration using Gauss Quadrature?
    # ng = 20
    # Specify the odd-numbered mo20
    # A = [1.0045792982170993, -0.009331896288504353, 0.011941839033552903, 0.004696343389728706, 0.015159353628757498, -0.0017903083000960515, -0.002894232919948337, -0.020512982818940012, 0.002637764690195105, -0.004037513173757581, -0.0018003802187872722, 3.678005969005036e-5, -0.0014290163838325764, 0.006142381043574873, -0.004210790695122118, 0.011885762061040916, -0.010245776149711871, -0.0023350247253041455, 0.9906425306905926, 0.0064477708652553225]
    # time
    # t = 0
    # Points and weights "using FastGaussQuadrature"
    xg, wg = gausslegendre(ng)
    # Generate a sin curve
    # x and y are nodes while x_m and y_m are mid points of an element
    x = LinRange(-π, π, nx)
    nm = 1:1:length(A)
    y = (sin.((x .- t) .* (2nm' .- 1))) * A #(Only odd numbered modes)
    # Mid-points (collocation points) and tangents at the mid points.
    xm = 0.5 * (x[2:nx] .+ x[1:nx-1])
    ym = 0.5 * (y[2:nx] .+ y[1:nx-1])
    ls = sqrt.((x[2:nx] .- x[1:nx-1]) .^ 2 .+ (y[2:nx] .- y[1:nx-1]) .^ 2) # length of each element
    tmx = (x[2:nx] .- x[1:nx-1]) ./ ls
    tmy = (y[2:nx] .- y[1:nx-1]) ./ ls
    sqrt.(tmx .^ 2 + tmy .^ 2)
    # Arc length of node points
    s = zeros(nx, 1)
    for i = 1:nx-1
        s[i+1] = s[i] + ls[i]
    end
    # Arc length of mid points
    sm = 0.5 * (s[2:nx] .+ s[1:nx-1])
    # Filament velocity at mid points (forcing for the equations)
    vm = (-(2nm' .- 1) .* cos.((xm .- t) .* (2nm' .- 1))) * A
    ρ = 0.001
    c = 2 * log(ρ / s[nx]) + 1
    RM, rhs = outer_middle_loops(
        xm,
        ym,
        tmx,
        tmy,
        sm,
        vm,
        ls,
        s,
        c,
        x,
        y,
        xg,
        wg,
        ng,
        nx,
    )

    fh = RM \ rhs
    Effc = -(fh[2*(nx-1)+1])^2 / (sum(fh[2:2:2*(nx-1)] .* ls .* vm))
    return Effc
    # println("The x-velocity is ", fh[2*(nx-1)+1])
    # display(plot(x,y, aspect_ratio=:equal))
    #display(quiver!(xm,ym,quiver=(tmx,tmy), aspect_ratio=:equal))
end

Did you look at GitHub - JuliaParallel/Elemental.jl: Julia interface to the Elemental linear algebra library. ?

2 Likes

(Hi Uwe) Do they show how to write a parallel code for a PDE? I could not find it, sorry.

Are you saying I will not benefit from parallelising the for-loops or storing the matrix RM in different processors when it gets too large, like (40,000 x 40,000) or so?

I don’t know if this helps, but you may want to look at PencilArrays.jl for this. Note that it works on top of MPI via MPI.jl. For the linear algebra, I guess you could use Elemental.jl as others have suggested.

1 Like

Not very much. When your equations are large, the O(n^2) operations are dwarfed by the O(n^3) operation. You should parallelize the for loops if you need to because the memory is distributed, but if you check the performance you’ll likely see that most of the time is spent in \, so swapping out to Elemental or just something else that does distributed/parallel \ is the next step.

3 Likes

(Hi Juan) Thanks for making me aware of https://github.com/jipolanco/PencilArrays.jl/. It looks promising and very close to what I have in mind. Can it handle unstructured grids like in finite or boundary elements?

Also, are there any finite element libraries in Julia that use any kind of parallelism?

Can it handle unstructured grids like in finite or boundary elements?

Unfortunately, I don’t think so. It was really made with structured grids in mind. It was initially conceived for spectral methods, but it should also be useful for other structured methods.

Also, are there any finite element libraries in Julia that use any kind of parallelism?

I’m not sure, but you can look at this list of Julia PDE packages to get an idea. There’s in particular a section dedicated to FEM packages.

2 Likes