What makes a language reach the "petaflop" mark?

lmiq · April 24, 2022, 10:57pm

When Julia was used in the Celeste project and reached the petaflop mark, that was an important milestone for the language, it seems.

I never exactly got what that means in terms of the qualities of the language itself. I mean, if a problem is embarrassingly parallel (it seems to me from the reading of some of the articles about that achievement that this was more or less the case), and one has access to those petaflops of hardware, it seems that any language could do it.

Of course it would be waste of resources to use that hardware with a slow serial implementation of anything, thus to qualify to run on that hardware the language has to be fast at least for serial programs (and Julia is, clearly).

But what other characteristics of the language are important in those very-high end supercomputers? Is communication on that hardware fast enough such that the performance of the message passing interfaces really matter? (In other words, can parallel computations that rely on heavy communication between workers run effectively at that scale?)

Are the parallelization interfaces of Julia, or the packages that wrap more standard interfaces (MPI etc), up to the task?

j_u · April 24, 2022, 11:17pm

It seems so: “A Look at Communication-Intensive Performance in Julia” by Rizvi and Hale [[2109.14072] A Look at Communication-Intensive Performance in Julia]. On the other hand, as far as I understand it (Im not an expert), it depends on a task and on an architecture. As far as I know, the performance of Julia on Fugaku [Fugaku (supercomputer) - Wikipedia] is not that great (tried it first hand a few months ago but there might have been some developments recently). It is also a question about the next big thing that is Aurora [Aurora (supercomputer) - Wikipedia]. Here Julia might not be the most favorable language as well, at least not as for now. It should be noted that both of those machines are / will be in the exaflops territory or close to.

giordano · April 24, 2022, 11:33pm

I think the main problem on A64FX is compilation latency, which isn’t limited to Julia (compiling anything on this CPU is excruciatingly slow), but a JIT language is hit particularly hard.

A couple of months ago I did a simple benchmark of a julia implementation of AXPY:

function axpy!(a, x, y)
    @simd for i in eachindex(x, y)
        @inbounds y[i] = muladd(a, x[i], y[i])
   end
   return y
end

versus vendor’s BLAS (using my little FujitsuBLAS.jl), and it was quite favourable to Julia:

but this function is memory-bound, rather than compute-bound. There are however some LLVM-related issues with using 512-bit vectors (Creating a random array results in a segfault in LLVM on A64FX with 512-bit vectors · Issue #44263 · JuliaLang/julia · GitHub, `minimum` makes Julia crash on A64FX · Issue #44401 · JuliaLang/julia · GitHub).

Why do you think that’s the case?

Oscar_Smith · April 24, 2022, 11:55pm

Can we get Fujitsu a special BLAS written in Julia with StaticCompiler? That would be awsome!

giordano · April 25, 2022, 12:11am

I haven’t tried Octavian yet because LoopVectorization used to make Julia crash (another llvm issue probably, it works better on julia nightly: JULIA_LLVM_ARGS="--aarch64-sve-vector-bits-min=" LLVM SegFault creating ntuple of VecElement of passed vector length · Issue #43069 · JuliaLang/julia · GitHub), but I think at some point I’ll try to run some benchmarks without 512-bit vectors, that mode is too broken to do anything.

j_u · April 25, 2022, 12:14am

Hi!

Yes, I remember our conversation dated a few weeks ago on Slack. You were not so keen at the beginning [to say it mildly :-)], however, I understood that there were some ongoing developments. And I mentioned it in my post, with a hope there were. Do you think that currently it is fully functional? I am genuinely interested, at least for now theoretically, but I’d like to underline the word genuinely.

As I understand it, Ponte Vecchio is crucial for this machine so as I understand is Julia support for Level Zero. I am recalling, I have been asking questions on this topic during GPU meetups as well as during public presentations of persons who are very knowledgeable on those topics, as well as in private correspondence. And it was my understanding that Ponte Vecchio (Level Zero) at that time was not fully supported. I should probably add, that again, since that time there might have been some developments. My knowledge on this topic is a few months / weeks old.

giordano · April 25, 2022, 12:24am

Fully functional, not really. As I mentioned before, 512-bit vectors are problematic in several ways (it’s probably at least partially an llvm issue), and that can be an important limitation as we can’t take full advantage of the hardware. And, again, compilation latency is very long. I’d like to make some more benchmarks, with something less toy-like and more elaborate, but haven’t got much time lately.

j_u · April 25, 2022, 2:07am

Yeah, I put it away for a moment as well. Apart to technical issues, I know that there are some other constraints / requirements associated with doing computations on Fugaku. On the other hand, it is a very green machine, that as I currently understand, has some similarities / extensions to other interesting technologies … but hey … I got a hobby project related to neural networks and quantum computing that is on my mind for some time. Would you be interested in briefly discussing those topics in slightly more detail?

j_u · April 26, 2022, 1:38pm

To sum up my part:

@giordano Here is a link to a little bit more info on this topic [Why Julia is very often related to a data science / economics language only? - #21 by j_u]. If you think that this might be to your interest just let me know.

@lmiq >But what other characteristics of the language are important in those very-high end supercomputers?
As is currently my general understanding on this subject, “those very-high end supercomputers”, are quite normal computers, and on most occasions, in reality, are pretty similar to the ones we use every day. As for Julia and its ecosystem, IMO, and not surprisingly, the most easy would be to use the ones build on x86 and Nvidia technologies. Also, there is currently a new clearly visible and worth mentioning trend related to high performance computations that are done in the cloud, the trend is called “democratization of HPC”, where some big cloud providers but also Julia Hub are prime examples. To try to answer you question more directly and with a smile, IMO, the main language characteristic would be that it just has to work there on “this very-high end supercomputer” and there has to be a team of people interested in a topic and willing to implement the solution.

paulmelis · April 27, 2022, 7:28am

In terms of CPU, memory etc of the individual compute nodes that is a fair comparison. However, note that what, in general, makes a supercomputer is a fast low-latency interconnect (such as InfiniBand) and high-performance (distributed) storage, such as Lustre or GPFS. It’s not about the individual nodes, but how they can efficiently work together on large problems.

j_u · April 27, 2022, 11:39am

I fully agree with you. In case of those exascale platforms that I mentioned, I believe there are different technologies that InfiniBand (in case (of interest) please find some additional info here: [Supercomputer Fugaku Introduction - RIKEN - YouTube, The Cray Shasta Architecture - YouTube]). Just wanted to put emphasize on the team of people who made the petaflop mark reality with Julia, their huge work on scientific problem, on instructions level parallelism and on other technicalities that made it possible to take maximum advantage of those flops on Cori [National Energy Research Scientific Computing Center - Wikipedia].

Palli · April 27, 2022, 8:19pm

Meaning no serial portion (or very small?). The serial portion of your program always limits parallel speedup, according to Amdahl’s law (and Gustafson’s law gets around it in a way, also for Celeste).

But you need a fast language also for the parallel part, because otherwise you’re throwing that much more hardware at the problem, which you do not want to do.

I believe even Python (and bash) is used on supercomputers, but I guess in a limited capacity, e.g. as a glue language and/or the heavy lifting actually done by e.g. C libraries it uses, so “using Python” misleading, also for supercomputers. Since a good rule of thumb is only 10% of code is run 90% of the time, just a small portion needs to be implemented in non-Python, but even those 90% can be a problem if in the serial portion, it has to be down to e.g. 10% and then you’re still limited to 10x parallel-speedup. Which isn’t great because then you could just beat Python with Julia on an single-core machine.

Having good garbage collector (or some way around it). I recall reading something about Celeste and GC, that the GC wasn’t optimal, then, so improving it would help even more. I don’t think they used GPUs back then with Julia; if you would then the GC on them (GPUCompiler.jl supports that, e.g. for CUDA).

Julia’s GC isn’t parallel, or at least wasn’t back then. There are some recent PR regarding to the Julia GC and I haven’t kept up, but I believe parallel is coming. Otherwise it’s stop-the-word when the GC kicks in. Note, that means the memory in your process, but if you do distributed/MPI then it’s only for one core and it’s memory space. Celeste was ~~before the threading-work of Julia~~ [EDIT: experimental thread support came in Julia 0.5, I’m not sure if it was actually used in Celeste.jl, still I see in the code “enable pre-allocated thread-safe pool”)], if I recall correctly, and with thread it would apply to all the threads running in the address-space of your process.

Interesting, I see there:

we followed Julia’s documented best practices for performance programming. This included making code typestable, eliminating use of any global variables, eliding bounds checks for known-length arrays (@inbounds), and carefully tuning core HPCG math kernels for performance bottlenecks, e.g. by instrumenting the garbage collector

How is the GC instrumented?

I also see there a new language I hadn’t heard of Regent, and some other mentioned for HPC, including Erlang.

j_u · April 28, 2022, 12:34am

I am very sorry. I am not the author of this paper. Also, AFAIK, I have never had a pleasure to be in touch with the authors. I see that in the presentation at the last JuliaCon the topic of GC was discussed two times “The State of Julia | JuliaCon 2021 | Stefan Karpinksi, Viral Shah, Jeff Bezanson, & Keno Fischer” [https://www.youtube.com/watch?v=IlFVwabDh6Q]. It was informed that the work is carried on “the GC state transitions which will allow the GC to run in parallel with other code” and that “a bunch of GC performance work” is on the list. I have been in touch and had a very helpful conversation with @vchuravy at the last JuliaHPC Monthly Call of 2021, so maybe he will be able to provide some additional info on this topic. Also I will allow myself to ping @tkf with a hope that he may provide some insights and additional explanations. Especially that this topic was started by our discourse julialang common friend. Of course its an open question if he decides to join … the exaflops territory team.

j_u · April 30, 2022, 10:06pm

@Palli My assumption is that the lack of replies might indicate that the quoted paper is really good, also for Julia, however, this is nothing more than just my assumption. This is not my area of expertise and even though I always liked computers, some of the topics covered there I understand in general and sometimes even in a very general scope thus I allowed myself to mention two persons, with whom, I had a pleasure to be somehow in touch here and whose knowledge and opinions I value.

To sum up my part (again), my interest associated with this particular topic is related to a project that I am currently taking into a consideration. I wrote some info about it here: [Julia's biggest success[es] so far? - #8 by j_u]. Just reached some milestones associated with very preliminary coding and with project involvement in external programs. I rarely put this kind of info publicly (if at all), however, based on my judgement, as for this stage, the project looks really good. I enjoyed the time spent at the forum so far and I have seen recently a few threads on similar topics related to new undertakings, so I decided to write about it (also here) with a hope that it might be to somebody’s interest. I would be happy to discuss and potentially to extend the team.

As a side note, it’s always interesting to read the thread started by my (virtual) friend @Imiq, however, I have to admit, that it is somehow surprising and rather unusual for him to abandon it without notice at such an early stage … :- )

j_u · June 3, 2022, 9:59pm

Hey, I have another out of the blue question, not quite about MPI this time. :- ) I just read that Arm Compilers and Performance Libraries for HPC Developers [are] Now Available for Free. I have to admit that I do not fully understand how the BLAS trampoline works and how difficult it is to make such technologies as FujitsuBLAS operational with Julia, however, I am wondering, do you think that ARM Performance Libraries could be useful, especially on Neoverse platform?

EDIT 1:
I did some additional reading about: i) libblastrampoline [GitHub - JuliaLinearAlgebra/libblastrampoline: Using PLT trampolines to provide a BLAS and LAPACK demuxing library.] and watched some videos, particularly ii) the one by you and by Mr. Elliot Saba [https://www.youtube.com/watch?v=t6hptekOR7s] and iii) also the one by @Elrod [https://www.youtube.com/watch?v=KQ8nvlURX4M]. I also forked your repo to get a better understanding, however, there are some areas that are still mysterious for me (as for coding I am not as experienced as you). What I am planning to do now is to: a) consult with ARM if such a trampoline connection is supported by their legal agreements and b) propose a topic for JuliaHPC Monthly Call - in your Famous Julia HPC Rookie Corner (5 minutes long time at the end of JuliaHPC Monthly Call session).

EDIT 2:
Hey @giordano, I got it up and running (armplblas.jl). About 150 lines of raw code including preliminary examples. Would you find some time to take a look and maybe provide some comments? I’m hoping to make a short presentation at JuliaHPC Meetup and later register a package (that would be my first). Would really appreciate some advice if possible - promise not to take much of your time.

EDIT 3:
@giordano, I understand there are some serious constraints that are preventing you from expressing your opinion wrt my code explicitly. I will try to work on improving the code and included testing examples with a hope of presenting and shortly discussing the results at the next HPC Meeting. Just wanted to shortly thank you as your FujitsuBLAS code was extremly useful for undertanding the subject.

empireofhappy · October 4, 2022, 9:39am

Maybe it’s could answer somehow your question:
“A language reaches the petaflop mark when it can be used to process one quadrillion floating point operations per second. This requires a special type of processor, known as a fusion power architecture (FPA) processor. FPAs are designed to handle large amounts of data quickly and efficiently. They are often used in supercomputers and other high-performance computing applications. In order to achieve petaflop speeds, a language must be able to take advantage of the parallel processing capabilities of an FPA processor. This means that the language must be able to break down a problem into smaller pieces that can be processed independently. Furthermore, the language must be able to communicate with the various components of an FPA processor in order to coordinate their efforts. Finally, the language must be able to run on a variety of different types of hardware, including GPUs and FPGAs. languages that meet all of these criteria are said to be “petaflop languages.””

j_u · October 6, 2022, 2:52pm

I have been reading about and testing (very briefly) some FPGAs, however, I have to admit that its the first time I am hearing about “fusion power architecture” so I will try to learn more about it. My point, especially at the first part of this thread was mostly related to: i) human / team factor as a requirement to make any language (any project) to reach a milestone such as discussed here “the petaflop mark” and ii) to mention / to interest / to potentially discuss the next milestone on the horizon which seems to be “the exaflop mark”. I am not sure who you were addressing directly, however, its always interesting to take part in discussions that take place at this forum. :- )

Topic		Replies	Views
Intel C/C++ compiler performance versus Julia Offtopic	20	6230	August 11, 2021
Show off Julia performance on your PC! Performance	53	4295	April 26, 2020
Julia faster on Mac OSX than in linux? New to Julia	23	3728	April 28, 2018
Julia vs Fortran complaint General Usage fortran	25	14620	July 20, 2017
Mention of Julia for the re-implementation of PETSC Offtopic	25	3473	September 29, 2020

What makes a language reach the "petaflop" mark?

Related topics