Julia's applicable context is getting narrower over time?

Have you slapped an AVX macro on inner parts?

Inner parts of libdeflate? I have only ported the decompression and the hottest loop of that is decoding a variable length code from bits. That is extremely sequential, you don’t even know where the next codeword begins until you have decoded the previous one. Great as LoopVectorization.jl is, I’d be extremely impressed if it can do anything with that.

On my machine (macOS) I get, for log:

5.065 ns (0 allocations: 0 bytes)
4.438 ns (0 allocations: 0 bytes)
4.188 ns (0 allocations: 0 bytes)

and for exp:

7.292 ns (0 allocations: 0 bytes)
4.805 ns (0 allocations: 0 bytes)
4.421 ns (0 allocations: 0 bytes)

Any idea why the Julia version appears to be slightly slower for me? I could understand why the system library functions are different/faster on macOS but LLVM?

History has shown that julia (and all other languages) niches only get wider with time.
because it simply takes someone deciding “I am going to make a (e.g.) HTTP server, or a Animation Library, or a Database client or …” and then you niche is wider.
Basically nothing makes a niche narrower ever.
Languages dying doesn’t even make it narrower, just more sparsely populated.
Which is a different thing, with a different set of concerns.

Python’s niche is not going to substantially grow. its basically as large as it can get, all domains where it is technically feasible to use python there is a library.
That niche is still becoming more populated (which as I said is a different thing).
But python libraries, by which I mean C/C++ libraries have been made to do everything.
Large C/C++ libraries are hard to extend, vs julia libraries.
For 1 because C/C++ are harder languages, for 2 because C/C++ doesn’t compose anywhere near as well as Julia.
Last I checked Python hard over 5 API identical but implementation distinct implementations of Numpy (mostly for ML).
Because you can’t substantially extend numpy, or add a new feature that works with it, without forking and editting the C/C++, or writing a whole new C/C++ library.
While in Julia you really can.
Adding something like NamedDims.jl to Numpy was months of work for PyTorch, thousands of lines of code throughout the codebase changed; the same in julia took me 2 weeks.
Adding BlockBandedMatrices to numpy again would require changing the library itself.
Most of the code in the python equiv of Measurements.jl (which defined a new scalar type) is not in defining the properties of the measurement, but in hooking into numpy (they actually did manage to do it without reimplementing Numpy itself. though i don’t think it will work with any of the other numpy implementations like PyTorch).

Luckily for python it has a lot of users, so they can afford to pay the expodentially growning cost of adding new features.
But in Julia that cost isn’t expodential.

22 Likes

There’s already a lot in this thread, but just to address the points directly:

Is this true?

No, neither the conclusion nor any of the assertions are true.

built-in functions of Python are written in C so they are faster than the Julia counterparts

I wonder exactly what the assertion here was, it’s easy for arguments to get distorted. Julia code is competitive with C, sometimes a bit slower, but increasingly faster, sometimes much faster. But, this is in a sense beside the point. More importantly Python is interpreted and Julia is compiled. Most people who know a bit about programming know that compiled languages are generally faster than interpreted languages even if the later are written in C. You’d have to have a sharper conversation with these people to find out exactly what their misconceptions are.

For large packages, people can use C++ to get extreme performance with its black magic.

  1. Indeed some people have a sense that C++ (or sometimes Julia) is fast because of black magic. This feeling inhibits thinking clearly about languages and technologies.

  2. The size of the package is irrelevant. People combine compiled and interpreted languages in packages of all sizes.

  3. It’s much easier to get extreme performance from Julia than C++.

Julia is only preferable for mid-sized projects.

I’m not sure what’s behind this, but no, it’s not true.

Over time, this niche will get narrower because Python’s built-in functions and large packages both become more comprehensive

Both the Python and Julia ecosystems are being developed. But, Julia is vastly more productive than the Python ecosystem. (Hundreds of specific examples can be found on this forum and elsewhere) So the relative advantage of Julia will grow rather than shrink.

It’s good to have conversations with everyone about these topics. But, it shouldn’t be surprising that the great majority of opinions that you find in forums and conversations are ill-informed. In absolute numbers, there are many people who have extensive experience with both Python and Julia. If you really want to understand where these languages are going, I advise finding these people and talking with them.

16 Likes

8 posts were split to a new topic: Julia matrix-multiplication performance

A small test script comparing element-wise operations in Numpy and Julia:

For array sizes up to 1,000 elements, Julia is always faster on my machine.
For 100,000 elements, the picture is mixed: Numpy is faster for additions and multiplications, whereas Julia is faster for exp and multiplication+ addition in one line.
For 1M elements Julia is again faster for all tested operations.

3 Likes

So if you are testing the lower-level linear operations, then the particular library (BLAS versus MKL) and its version may play a role.

3 Likes

I tested only element-wise operations, which are to my knowledge not using BLAS/MKL.
For matrix multiplication, etc. you are correct - there Numpy is probably faster on Intel machines because it uses MKL as default (for Conda).

3 Likes

A post was merged into an existing topic: Julia matrix-multiplication performance

Hmm, I think LLVM is just calling the system libraries, but with a faster calling convention, using just a jmp instead of a call:

# julia> @code_native syntax=:intel debuginfo=:none llvmexp(1.2)
        .text
        movabs  rax, offset exp
        jmp     rax
        nop     dword ptr [rax]

# julia> @code_native syntax=:intel debuginfo=:none cexp(1.2)
        .text
        push    rax
        movabs  rax, offset exp
        call    rax
        pop     rax
        ret
        nop

My earlier tests were on a 7900X (Intel Skylake-X CPU) running Clear Linux.
So it seems Apple has faster system log and exp than Clear Linux.

I just ran them on a 7980XE CPU running Arch Linux (same generation of CPU, just a different model with more cores):

julia> @btime log($(Ref(1.2))[])
  5.424 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime clog($(Ref(1.2))[])
  8.146 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime llvmlog($(Ref(1.2))[])
  7.690 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime exp($(Ref(1.2))[])
  5.154 ns (0 allocations: 0 bytes)
3.3201169227365472

julia> @btime cexp($(Ref(1.2))[])
  22.318 ns (0 allocations: 0 bytes)
3.3201169227365472

julia> @btime llvmexp($(Ref(1.2))[])
  22.315 ns (0 allocations: 0 bytes)
3.3201169227365472

Julia timings are consistent across all three computers and OSes, but Arch’s exp is over 4x slower than Apple’s?

I think Clear Linux beats Arch here because on Clear Linux glibc detects hardware features at startup so it can run hardware-specific optimized versions. I’m guessing – from those benchmarks – that Arch does not do this.

Also guessing that Apple does something similar, but apparently better optimized.

It’s also a little funny how slow this ccall is.

On Linux, folks can try

using Libdl
const LIBMVEC = find_library(["libmvec.so"], ["/usr/lib64/", "/usr/lib", "/lib/x86_64-linux-gnu"])
run(pipeline(`nm -D $LIBMVEC`, `grep exp`)) # check function names

On Apple, I’d look through the AppleAccelerate libraries to see what you can find.

I get:

julia> run(pipeline(`nm -D $LIBMVEC`, `grep exp`))
                 U exp@GLIBC_2.29
                 U expf@GLIBC_2.27
0000000000002b80 i _ZGVbN2v_exp@@GLIBC_2.22
0000000000002c80 i _ZGVbN4v_expf@@GLIBC_2.22
0000000000002bb0 T _ZGVcN4v_exp@@GLIBC_2.22
0000000000002cb0 T _ZGVcN8v_expf@@GLIBC_2.22
0000000000002c00 i _ZGVdN4v_exp@@GLIBC_2.22
0000000000002d00 i _ZGVdN8v_expf@@GLIBC_2.22
0000000000002d30 i _ZGVeN16v_expf@@GLIBC_2.22
0000000000002c30 i _ZGVeN8v_exp@@GLIBC_2.22
Base.ProcessChain(Base.Process[Process(`nm -D /usr/lib64/libmvec.so`, ProcessExited(0)), Process(`grep exp`, ProcessExited(0))], Base.DevNull(), Base.DevNull(), Base.DevNull())

_ZGVeN8v_exp is for 8x Float64 and requires AVX512, which this computer has.

const SIMDVec{W,T} = NTuple{W,Core.VecElement{T}}
vexp(x::SIMDVec{8,Float64}) = @ccall LIBMVEC._ZGVeN8v_exp(x::SIMDVec{8,Float64})::SIMDVec{8,Float64}
t = ntuple(_ -> randn(), Val(8))
vet = map(Core.VecElement, t);
exp.(t)
map(x -> x.value, vexp(vet))
@btime exp.($(Ref(t))[])
@btime vexp($(Ref(vet))[])

I get

julia> exp.(t)
(7.063262403134325, 0.4176291014524549, 1.4764646883598358, 0.48878815987898877, 0.15684534932764455, 7.226086027587759, 2.3485301741066205, 5.409615070945328)

julia> map(x -> x.value, vexp(vet))
(7.063262403134324, 0.41762910145245496, 1.4764646883598358, 0.48878815987898866, 0.15684534932764457, 7.226086027587759, 2.348530174106621, 5.409615070945328)

julia> @btime exp.($(Ref(t))[])
  44.465 ns (0 allocations: 0 bytes)
(7.063262403134325, 0.4176291014524549, 1.4764646883598358, 0.48878815987898877, 0.15684534932764455, 7.226086027587759, 2.3485301741066205, 5.409615070945328)

julia> @btime vexp($(Ref(vet))[])
  5.029 ns (0 allocations: 0 bytes)
(VecElement{Float64}(7.063262403134324), VecElement{Float64}(0.41762910145245496), VecElement{Float64}(1.4764646883598358), VecElement{Float64}(0.48878815987898866), VecElement{Float64}(0.15684534932764457), VecElement{Float64}(7.226086027587759), VecElement{Float64}(2.348530174106621), VecElement{Float64}(5.409615070945328))

So 5.029 ns for GLIBC to calculate 8 exps when I specifically call the AVX-512 version, but (on Arch Linux) by default it’ll call some slow generic version that takes over 4 times longer to calculate just a single exponential. Ha.

GLIBC can have implementations all it wants, but it doesn’t do any good if they don’t get used. =/
Julia using it’s own libraries provides some consistency, in particular helping performance for some folks (like those on Arch), and also making sure everyone’s implementation are help to roughly the same accuracy standard.
Julia’s exp is more accurate than the SIMD version from GLIBC, for example, but otherwise follows a similar implementation approach (which I described, and @Oscar_Smith implemented + increased the accuracy of based on the description without ever looking at GPL source).

EDIT:
On a different computer running Ubuntu (but with a different CPU, an i7 1165G7):

julia> @btime log($(Ref(1.2))[])
  3.863 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime clog($(Ref(1.2))[])
  6.578 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime llvmlog($(Ref(1.2))[])
  5.971 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime exp($(Ref(1.2))[])
  3.856 ns (0 allocations: 0 bytes)
3.3201169227365472

julia> @btime cexp($(Ref(1.2))[])
  18.564 ns (0 allocations: 0 bytes)
3.3201169227365472

julia> @btime llvmexp($(Ref(1.2))[])
  18.492 ns (0 allocations: 0 bytes)
3.3201169227365472

julia> versioninfo()
Julia Version 1.7.0-DEV.526
Commit 6468dcb04e* (2021-02-13 02:44 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_NUM_THREADS = 8

Thankfully it sounds like all Linux distros (not just Clear) should start benefiting from similar optimizations fairly soon.

7 Likes

On my MacOS (intel) machine everything in Julia 1.6-rc1 is faster than numpy 1.19.5 in Python 3.8.5.

1 Like

It’s impressive that it’s possible to beat optimized C libraries in certain cases with native Julia code, but focusing too much on such cases will often lead people astray. Benchmarks against optimized C code help establish Julia’s performance capabilities, but only represent a starting point for interesting work.

Fundamentally, the reason Julia exists is not to beat the performance of existing C libraries on existing problems. If all the high-performance code you will ever need has already been written, then you don’t have as much need for a new language. Julia is attractive for people who need to write new high-performance code to solve new problems, which don’t fit neatly into the boxes provided by existing numpy library functions.

It’s not that Julia has some secret sauce that allows it to beat C — it is just that its compiled performance is comparable to that of C (boiling down to the same LLVM backend), so depending on the programmer’s cleverness and time it will sometimes beat C libraries and sometimes not.

It is, however, vastly easier to write high-performance Julia code than comparable C code in most cases, because Julia is a higher level language with performant high-level abstractions. This can also make it easier to be clever, with tricks like metaprogramming that are tedious in C. And the code you write in Julia can at the same time be more flexible (type-generic and composable), allowing greater code re-use.

This is important for those of us who want to work on new problems and to go beyond endlessly recycling the solutions of the past.

71 Likes

Another observation is that Go has been very successful for writing high performance multithreaded server code. And guess what the only mainstream dynamic language with the same threading model is? Julia. Heck the other dynamic languages are mostly incapable of any actual threading let alone anything as fancy as Go and Julia. Now, of course, Julia’s implementation is nowhere near as mature or well-tuned for servers as Go’s is, but there are many reasons to believe that Julia will be increasingly popular for that kind of application among programmers who want easy, high performance concurrency but want to keep using a dynamic language.

21 Likes

Does that mean once an algorithm matures people will still ultimately rewrite it in C/C++ to get the best performance?

rarely you have a completely “figured out” subroutine with only primitive data type and doesn’t involve any part user ever want to customize.

Even tasks as “clearly defined” as what OpenBLAS does are not completely fixed due to the fact that one can have a special (sparse, hermitian) matrix and/or the matrix can be number-like but not IEEE floats. Compare to a (imaginative) BLAS written with Tullio which would have much more flexibility.

so, technically yes, but it’s nothing compared to the infinitely more things that need to be customized or not even conceived at all.

1 Like

I don’t see why. One of the strengths of Julia is the macro system allowing you to get “under the hood” with minimal effort.
It would certainly be possible to do the high performance optimizations that you might do in C, in Julia.
Doesn’t loop_vectorization demonstrate that this is not only possible but is already being done ?

2 Likes

To clarify that statement a little: it’s not that Julia can’t reach the performance of highly tuned C libraries, it’s just that doing so takes a long time (although it probably takes more time to optimize C). The first draft of Julia code (if written by someone who knows how to use Julia) is generally competitive with non-tuned C, and Julia has more easy performance tuning than C (things like @avx, @simd etc). Then getting the last drops of performance out of either takes a lot of effort.

5 Likes

No. While Steven pointed out that because Julia uses the same LLVM compiler that compiles C code, and therefore isn’t generically faster than C, the converse is also true: C isn’t generically faster than Julia. In almost all situations I’ve encountered or cared enough to invest in, Julia = C for performance and is vastly more pleasant to code in. Yes, there are cases where you have to know how to write the code; parsing, like CSV and libdeflate, is a tricky case for Julia to handle because it’s intrinsically type-unstable. But the history of CSV parsing shows that with sufficient investment you can go from behind-the-competition to best-of-breed once Julia’s features are fully exploited. And it’s not like those other CSV libraries in other languages hadn’t had a lot of time invested in them.

35 Likes

This would probably come down to a time/cost vs performance gain. If Julia could only get to half the performance of C/C++ then yes, looks like you are going to do something in C/C++. If Julia gives you 98 to 99 percent the performance, is it really worth it? Now where that percentage cutoff is…that probably depends on the situation.

1 Like