Julia's applicable context is getting narrower over time?

There’s already a lot in this thread, but just to address the points directly:

Is this true?

No, neither the conclusion nor any of the assertions are true.

built-in functions of Python are written in C so they are faster than the Julia counterparts

I wonder exactly what the assertion here was, it’s easy for arguments to get distorted. Julia code is competitive with C, sometimes a bit slower, but increasingly faster, sometimes much faster. But, this is in a sense beside the point. More importantly Python is interpreted and Julia is compiled. Most people who know a bit about programming know that compiled languages are generally faster than interpreted languages even if the later are written in C. You’d have to have a sharper conversation with these people to find out exactly what their misconceptions are.

For large packages, people can use C++ to get extreme performance with its black magic.

  1. Indeed some people have a sense that C++ (or sometimes Julia) is fast because of black magic. This feeling inhibits thinking clearly about languages and technologies.

  2. The size of the package is irrelevant. People combine compiled and interpreted languages in packages of all sizes.

  3. It’s much easier to get extreme performance from Julia than C++.

Julia is only preferable for mid-sized projects.

I’m not sure what’s behind this, but no, it’s not true.

Over time, this niche will get narrower because Python’s built-in functions and large packages both become more comprehensive

Both the Python and Julia ecosystems are being developed. But, Julia is vastly more productive than the Python ecosystem. (Hundreds of specific examples can be found on this forum and elsewhere) So the relative advantage of Julia will grow rather than shrink.

It’s good to have conversations with everyone about these topics. But, it shouldn’t be surprising that the great majority of opinions that you find in forums and conversations are ill-informed. In absolute numbers, there are many people who have extensive experience with both Python and Julia. If you really want to understand where these languages are going, I advise finding these people and talking with them.

16 Likes

8 posts were split to a new topic: Julia matrix-multiplication performance

A small test script comparing element-wise operations in Numpy and Julia:

For array sizes up to 1,000 elements, Julia is always faster on my machine.
For 100,000 elements, the picture is mixed: Numpy is faster for additions and multiplications, whereas Julia is faster for exp and multiplication+ addition in one line.
For 1M elements Julia is again faster for all tested operations.

3 Likes

So if you are testing the lower-level linear operations, then the particular library (BLAS versus MKL) and its version may play a role.

3 Likes

I tested only element-wise operations, which are to my knowledge not using BLAS/MKL.
For matrix multiplication, etc. you are correct - there Numpy is probably faster on Intel machines because it uses MKL as default (for Conda).

3 Likes

A post was merged into an existing topic: Julia matrix-multiplication performance

Hmm, I think LLVM is just calling the system libraries, but with a faster calling convention, using just a jmp instead of a call:

# julia> @code_native syntax=:intel debuginfo=:none llvmexp(1.2)
        .text
        movabs  rax, offset exp
        jmp     rax
        nop     dword ptr [rax]

# julia> @code_native syntax=:intel debuginfo=:none cexp(1.2)
        .text
        push    rax
        movabs  rax, offset exp
        call    rax
        pop     rax
        ret
        nop

My earlier tests were on a 7900X (Intel Skylake-X CPU) running Clear Linux.
So it seems Apple has faster system log and exp than Clear Linux.

I just ran them on a 7980XE CPU running Arch Linux (same generation of CPU, just a different model with more cores):

julia> @btime log($(Ref(1.2))[])
  5.424 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime clog($(Ref(1.2))[])
  8.146 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime llvmlog($(Ref(1.2))[])
  7.690 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime exp($(Ref(1.2))[])
  5.154 ns (0 allocations: 0 bytes)
3.3201169227365472

julia> @btime cexp($(Ref(1.2))[])
  22.318 ns (0 allocations: 0 bytes)
3.3201169227365472

julia> @btime llvmexp($(Ref(1.2))[])
  22.315 ns (0 allocations: 0 bytes)
3.3201169227365472

Julia timings are consistent across all three computers and OSes, but Arch’s exp is over 4x slower than Apple’s?

I think Clear Linux beats Arch here because on Clear Linux glibc detects hardware features at startup so it can run hardware-specific optimized versions. I’m guessing – from those benchmarks – that Arch does not do this.

Also guessing that Apple does something similar, but apparently better optimized.

It’s also a little funny how slow this ccall is.

On Linux, folks can try

using Libdl
const LIBMVEC = find_library(["libmvec.so"], ["/usr/lib64/", "/usr/lib", "/lib/x86_64-linux-gnu"])
run(pipeline(`nm -D $LIBMVEC`, `grep exp`)) # check function names

On Apple, I’d look through the AppleAccelerate libraries to see what you can find.

I get:

julia> run(pipeline(`nm -D $LIBMVEC`, `grep exp`))
                 U exp@GLIBC_2.29
                 U expf@GLIBC_2.27
0000000000002b80 i _ZGVbN2v_exp@@GLIBC_2.22
0000000000002c80 i _ZGVbN4v_expf@@GLIBC_2.22
0000000000002bb0 T _ZGVcN4v_exp@@GLIBC_2.22
0000000000002cb0 T _ZGVcN8v_expf@@GLIBC_2.22
0000000000002c00 i _ZGVdN4v_exp@@GLIBC_2.22
0000000000002d00 i _ZGVdN8v_expf@@GLIBC_2.22
0000000000002d30 i _ZGVeN16v_expf@@GLIBC_2.22
0000000000002c30 i _ZGVeN8v_exp@@GLIBC_2.22
Base.ProcessChain(Base.Process[Process(`nm -D /usr/lib64/libmvec.so`, ProcessExited(0)), Process(`grep exp`, ProcessExited(0))], Base.DevNull(), Base.DevNull(), Base.DevNull())

_ZGVeN8v_exp is for 8x Float64 and requires AVX512, which this computer has.

const SIMDVec{W,T} = NTuple{W,Core.VecElement{T}}
vexp(x::SIMDVec{8,Float64}) = @ccall LIBMVEC._ZGVeN8v_exp(x::SIMDVec{8,Float64})::SIMDVec{8,Float64}
t = ntuple(_ -> randn(), Val(8))
vet = map(Core.VecElement, t);
exp.(t)
map(x -> x.value, vexp(vet))
@btime exp.($(Ref(t))[])
@btime vexp($(Ref(vet))[])

I get

julia> exp.(t)
(7.063262403134325, 0.4176291014524549, 1.4764646883598358, 0.48878815987898877, 0.15684534932764455, 7.226086027587759, 2.3485301741066205, 5.409615070945328)

julia> map(x -> x.value, vexp(vet))
(7.063262403134324, 0.41762910145245496, 1.4764646883598358, 0.48878815987898866, 0.15684534932764457, 7.226086027587759, 2.348530174106621, 5.409615070945328)

julia> @btime exp.($(Ref(t))[])
  44.465 ns (0 allocations: 0 bytes)
(7.063262403134325, 0.4176291014524549, 1.4764646883598358, 0.48878815987898877, 0.15684534932764455, 7.226086027587759, 2.3485301741066205, 5.409615070945328)

julia> @btime vexp($(Ref(vet))[])
  5.029 ns (0 allocations: 0 bytes)
(VecElement{Float64}(7.063262403134324), VecElement{Float64}(0.41762910145245496), VecElement{Float64}(1.4764646883598358), VecElement{Float64}(0.48878815987898866), VecElement{Float64}(0.15684534932764457), VecElement{Float64}(7.226086027587759), VecElement{Float64}(2.348530174106621), VecElement{Float64}(5.409615070945328))

So 5.029 ns for GLIBC to calculate 8 exps when I specifically call the AVX-512 version, but (on Arch Linux) by default it’ll call some slow generic version that takes over 4 times longer to calculate just a single exponential. Ha.

GLIBC can have implementations all it wants, but it doesn’t do any good if they don’t get used. =/
Julia using it’s own libraries provides some consistency, in particular helping performance for some folks (like those on Arch), and also making sure everyone’s implementation are help to roughly the same accuracy standard.
Julia’s exp is more accurate than the SIMD version from GLIBC, for example, but otherwise follows a similar implementation approach (which I described, and @Oscar_Smith implemented + increased the accuracy of based on the description without ever looking at GPL source).

EDIT:
On a different computer running Ubuntu (but with a different CPU, an i7 1165G7):

julia> @btime log($(Ref(1.2))[])
  3.863 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime clog($(Ref(1.2))[])
  6.578 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime llvmlog($(Ref(1.2))[])
  5.971 ns (0 allocations: 0 bytes)
0.1823215567939546

julia> @btime exp($(Ref(1.2))[])
  3.856 ns (0 allocations: 0 bytes)
3.3201169227365472

julia> @btime cexp($(Ref(1.2))[])
  18.564 ns (0 allocations: 0 bytes)
3.3201169227365472

julia> @btime llvmexp($(Ref(1.2))[])
  18.492 ns (0 allocations: 0 bytes)
3.3201169227365472

julia> versioninfo()
Julia Version 1.7.0-DEV.526
Commit 6468dcb04e* (2021-02-13 02:44 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_NUM_THREADS = 8

Thankfully it sounds like all Linux distros (not just Clear) should start benefiting from similar optimizations fairly soon.

7 Likes

On my MacOS (intel) machine everything in Julia 1.6-rc1 is faster than numpy 1.19.5 in Python 3.8.5.

1 Like

It’s impressive that it’s possible to beat optimized C libraries in certain cases with native Julia code, but focusing too much on such cases will often lead people astray. Benchmarks against optimized C code help establish Julia’s performance capabilities, but only represent a starting point for interesting work.

Fundamentally, the reason Julia exists is not to beat the performance of existing C libraries on existing problems. If all the high-performance code you will ever need has already been written, then you don’t have as much need for a new language. Julia is attractive for people who need to write new high-performance code to solve new problems, which don’t fit neatly into the boxes provided by existing numpy library functions.

It’s not that Julia has some secret sauce that allows it to beat C — it is just that its compiled performance is comparable to that of C (boiling down to the same LLVM backend), so depending on the programmer’s cleverness and time it will sometimes beat C libraries and sometimes not.

It is, however, vastly easier to write high-performance Julia code than comparable C code in most cases, because Julia is a higher level language with performant high-level abstractions. This can also make it easier to be clever, with tricks like metaprogramming that are tedious in C. And the code you write in Julia can at the same time be more flexible (type-generic and composable), allowing greater code re-use.

This is important for those of us who want to work on new problems and to go beyond endlessly recycling the solutions of the past.

71 Likes

Another observation is that Go has been very successful for writing high performance multithreaded server code. And guess what the only mainstream dynamic language with the same threading model is? Julia. Heck the other dynamic languages are mostly incapable of any actual threading let alone anything as fancy as Go and Julia. Now, of course, Julia’s implementation is nowhere near as mature or well-tuned for servers as Go’s is, but there are many reasons to believe that Julia will be increasingly popular for that kind of application among programmers who want easy, high performance concurrency but want to keep using a dynamic language.

20 Likes

Does that mean once an algorithm matures people will still ultimately rewrite it in C/C++ to get the best performance?

rarely you have a completely “figured out” subroutine with only primitive data type and doesn’t involve any part user ever want to customize.

Even tasks as “clearly defined” as what OpenBLAS does are not completely fixed due to the fact that one can have a special (sparse, hermitian) matrix and/or the matrix can be number-like but not IEEE floats. Compare to a (imaginative) BLAS written with Tullio which would have much more flexibility.

so, technically yes, but it’s nothing compared to the infinitely more things that need to be customized or not even conceived at all.

1 Like

I don’t see why. One of the strengths of Julia is the macro system allowing you to get “under the hood” with minimal effort.
It would certainly be possible to do the high performance optimizations that you might do in C, in Julia.
Doesn’t loop_vectorization demonstrate that this is not only possible but is already being done ?

2 Likes

To clarify that statement a little: it’s not that Julia can’t reach the performance of highly tuned C libraries, it’s just that doing so takes a long time (although it probably takes more time to optimize C). The first draft of Julia code (if written by someone who knows how to use Julia) is generally competitive with non-tuned C, and Julia has more easy performance tuning than C (things like @avx, @simd etc). Then getting the last drops of performance out of either takes a lot of effort.

5 Likes

No. While Steven pointed out that because Julia uses the same LLVM compiler that compiles C code, and therefore isn’t generically faster than C, the converse is also true: C isn’t generically faster than Julia. In almost all situations I’ve encountered or cared enough to invest in, Julia = C for performance and is vastly more pleasant to code in. Yes, there are cases where you have to know how to write the code; parsing, like CSV and libdeflate, is a tricky case for Julia to handle because it’s intrinsically type-unstable. But the history of CSV parsing shows that with sufficient investment you can go from behind-the-competition to best-of-breed once Julia’s features are fully exploited. And it’s not like those other CSV libraries in other languages hadn’t had a lot of time invested in them.

35 Likes

This would probably come down to a time/cost vs performance gain. If Julia could only get to half the performance of C/C++ then yes, looks like you are going to do something in C/C++. If Julia gives you 98 to 99 percent the performance, is it really worth it? Now where that percentage cutoff is…that probably depends on the situation.

1 Like

You must have a very peculiar understanding of how programming languages work. FWIW, there is no magic (of any color) in C++ or Julia. It’s just a compiler.

Idiomatic Julia code can generate programs that are as fast as optimized C or C++ (modulo some LLVM quirk). The great thing about Julia is that I can do this with relatively little effort, and still have maintainable code.

No, there is no need for this if you are coding in Julia.

9 Likes

If you want to provide performance optimized code/algorithms as a library for the benefit to other programming environments it is still more natural to use C/C++/Fortran languages.

You wouldn’t develop in Julia when you want to make the high performance library available to other languages.

In this sense, the applicable context is more narrow than for C/C++. This has been so from the beginning, therefor it’s not getting narrower but stays constant. This isn’t valid in general, where the context broadens IMO.

For me this would be the most important improvement for Julia: compiling+linking native executables and shared/static libraries from Julia code.

9 Likes

I’m not sure if that’s true. If it takes 10x longer time to write the high-performance library in C than it would take you in julia I imagine that the extra cost of the development might sometimes be hard to motivate. There are also a number of examples out there in the wild where people have used julia to speed up stuff in R or python, simply because it’s nicer to write julia than C++, or that they wanted to use some other library available in julia.

4 Likes

There have been many questions in the forums about people trying to embed Julia in C/C++. I’ve just assumed that it was for an application, but they could be building libraries. If the bulk of the code can be easily written in Julia then it could make sense to just provide a small C/C++ API layer.

1 Like