Is Intel’s ParallelAccelerator.jl still maintained?

question
package
parallel

#1

Just a side question, is ParallelAccelerator.jl still under maintenance?


The State of the Julia Ecosystem
Status of and need for ParallelAccelerator
#2

I just checked the github of the top two contributors of this package (ehsantn and DrTodd13) ,and found that most of their recent commits are related to hpat and numba.


#3

HPAT.jl seems to be deprecated

Just wonder why this project switched from Julia to Python…


#4

@Yifan_Liu I guess it is just because of the wider spreading of the language rather than the technology, which means it is easier to get more users of the package. I use Python for many years. Compared with Python, I am satisfied with Julia, especially the clean code. High performance is another reason, although for the daily work I did not feel that most of the time, because most of the time the calculation load is not so high.

1. JIT time sometimes really hurt the user experience

Scenario 1 : I almost never precompile the function and run that again for the daily work. For example, for some task Python code needs 0.5 second, Julia without compile needs 1.5 seconds, and Julia with compile needs 0.1 second. In this situation, I never add a line of code and run again to gain 0.4 second. Thus, for most of the daily work, I almost do not gain the speed up, although I can by adding some extra code and run more time.

Scenario 2 : Import some large package sometimes need more time compared with Python, when the package needs compile. The user would think, “Oh, I cannot feel the high speed”. Currently, the loading speed is extremely fast compared with Julia 0.3 and 0.4. For me, I am satisfied with the loading speed now.

Scenario 3: The users use some python packages, for example Numpy or other packages, developed using C/C++. The speed is even faster than the Julia version. For the user, he/she just feel that the speed is very fast in Python. Maybe, Good language + Excellent packages + more users = Success. In my opinion, Julia is already a good language, more packages/users would need some time. I remember, in 2014 the speed of Pandas of Python is very slow, maybe even slower than the current DataFrames.jl. But when it has more users and many smart people contribute code, it’s performance improved dramatically.

2. Just according to my own feeling, Julia does very well in the following compared with Python.

(1) Julia JIT is much much better than Python
For Python, some time when I add @jit(nopython=True) to open JIT, very easy to get errors. The other Python JIT projects, such as Microsoft/ Pyjion , dropbox/ pyston are almost dropped now; PyPy is also not widely spreaded.
(2) Julia supporting for GPU and TPU looks very promising (Keno,TPU; Paper) :grinning: Keno is a superman!
(3) Julia can really gain the speed up, especially when there are a large number of iterations in the code.
(4) Julia is based on LLVM, and thus there are many possible extensions, such as Julia in the Browser .


#5

Why would it be? Basically all of what ParallelAccelerator was doing is now part of Julia’s Base. It was great back in v0.4, v0.5, and a bit of v0.6, but the developers of the package have worked with the Julia compiler team to essentially eliminate the need for the macro and make it automatic. Its main points were the way it would fuse broadcasting operations and add multithreading via its macro. At this point, Julia’s Base broadcasting automatically does fusion (in a customizable way), and multithreading will soon surpass ParallelAccelerator with the next generation PARTR implementation. Things like @fastmath and automatic SIMD in Base has also improved, and the new IR has allowed more optimizations. So, what would ParallelAccelerator even do to Julia v1.0 code?

Even some of the minor things that it did, like stencil calculations, have been supersceded. Specifically, stencil calculations are exactly what’s used in convolutional neural networks and PDEs, so you have things like Flux.jl building multithreaded stencil tooling which is compatible with GPUArrays and JuliaDiffEq building tools for easy construction of such stencils directly from PDE verbiage. So, what would the point of ParallelAccelerator.jl’s CPU-only non-stacking stencils with these newer developments?

That isn’t to say ParallelAccelerator.jl was a wasted development. It pioneered a lot of these approaches, quantified what was capable, and drove people to make it automatic. It was a great project that served its purpose. Long live the legacy of ParallelAccelerator.jl. I think the final nail would be to get benchmarks showing standard Julia v1.0 code is close to ParallelAccelerator.jl’s optimized v0.5 results, showing how far we’ve come.


#6

Is there already tutorial/examples of PDE solvers with JuliaDiffEq ?

I have implemented FEM Based transport (6D) and non homogeneous diffusion solvers (3-4D) in Julia and I would be very interested to see how I could use JuliaDiffEq for this kind of large simulations.


#7

We are focusing on finite difference in DiffEqOperators.jl, nothing for FEM (anymore). FDM discretizations are just simple stencils.


#8

OK.
Is there any example for 3D PDEs based on stencil in JuliaDiffEq?


#9

No, not yet. That would use the lazy kron


#10

Thanks


#11

I find that whole paragraph highly inaccurate (if my understand of ParallelAccelerator.jl is correct). The core contribution of ParallelAccelerator.jl was to automatically detect code that could be parallelized and then rewrite it into a data parallel version that uses threads. The clue was that as a user, you didn’t have to know anything about parallel programming. There is nothing like that in Base.

PARTR is a low level thread scheduling solution (which is very cool), but at a completely different level of abstraction. As far as I understand it, it won’t help at all with automatically parallelizing your code. Something like ParallelAccelerator.jl could be built on top of it. I believe it makes more sense to compare PARTR with OpenMP, which was used in ParallelAccelerator to schedule work.


#12

Apparently it wasn’t clear so let me make it explicit. Of course there is a difference, but making up the difference is now trivial. What ParallelAccelerator truly did was look array expressions, realize when you could fuse them, and then in its compiler/transpiler phase it built fused multithreaded expressions in C++ + OpenMP and replaces the Julia code with a call to that library. That is the core of ParallelAccelerator and was magic in previous Julia versions.

However, it’s obvious how to solve that with pure Julia now. Create an array wrapper type MultithreadedArray with an ArrayStyle to hijack broadcast, and use Julia’s built-in multithreaded in the definitions of the broadcast copyto. Standard Julia performs the fusion, and now you have the operations multithreaded. That shouldn’t take more than 100 LoC, and when PARTR comes out that will automatically layer the multithreading in a nice manner. Tie a bow on it by making a @dott macro that just wraps everything in the wrapper type. If you want to complete the transition, now make the wrapper replace map calls with tmap (KISSThreading.jl) and pmap, transform comprehensions to a tmap. That would make it equivalent to @acc just without parallel reductions, only because parallel reductions aren’t in Base (or KISSThreading) yet.

So, would you prefer that ~200 lines of pure canonical Julia code to implement a broadcast overload and a macro for find/replace using it, or a macro that transpiles to C++, compiles the C++, and links the code in there? I don’t see a need for a transpilation here at all anymore, especially because it hard limited what kinds of codes @acc could be applied to. And there’s a good chance this macro will even be in Base: https://github.com/JuliaLang/julia/issues/19777 . This is what I mean by it was obsoleted by broadcast overloading and Base multithreading.


#13

There’s a difference between “will be obsoleted at some point because the infrastructure is there and it is now easy to do in a much nicer way” and "just do use X", though.

Regarding MultithreadedArray, shouldn’t that role be taken up by SharedArray?


#14

No, SharedArray is a concept with non-shared memory, having an array that is doing the transfers implicitly for you. MultithreadedArray is a different idea.


#15

Just to clarify, there is no need for a dedicated MultithreadedArray to get multi-threading for array operations. Once PARTR is in, one can build open that a runtime system that schedules work on different threads. We might still require some macro to indicate that certain loops can be threaded (and are thread safe) but the actual mapping should be done by a global scheduler that controls the overall resources.

SharedArrays are an orthogonal concept and useful for inter-process data exchange.


#16

Sorry about the confusion: I usually think in terms of distributed=MPI, shared=OpenMP, so it seemed natural that SharedArrays did that.

+1 for automatic parallelization of dot calls (preferably without a macro), that’d be awesome.


#17

Oh wow, that’s even better than I thought. I thought that someone would have to explicitly add in tasks for it?


#18

Have a look at Kirans talk here:
https://www.youtube.com/watch?v=YdiZa0Y3F3c
How it will finally look is certainly not fully clear but I would imagine a system where the multi-threading is kind of semiautomatic. It can be automatic when using high-level operations like broadcast but if you implement loops yourself it is close to unavoidable to tell the compiler whether it is of to execute a loop in arbitrary order. But one can just look at established systems like OpenMP and Cilk and borrow ideas from there. The key thing is support nested parallelism and have some global resource management.


#19

For multithreaded implementation of broadcasting, people may want to check out Strided.jl