The Julia Challenge

Which one? Maybe xtensor solution is slower?

As fast as the fastest solution :wink:

What he used before?

xtensor::sin - which is supposed to be faster much like Julia’s sin is supposed to be faster since it supports simd + inlining better - but trips you up in this particular situation.

xtensor solution has benchmark and solution in one file.

That’s not true, it has only the benchmark code, since the implementation happens inside xtensor - and the implementation is exactly the target of the Julia challenge. So the real solution would be xtensor, which is a magnitude order more complicated :stuck_out_tongue: The xtensor developers also seem to agree with it being quite the monster and might refactor xtensor based on the new C++ 17 implementation born out of the challenge, which turned out to much more concise!

So a fair comparison would be to compare it to Julia’s broadcasting without the implementation, which would look something like this:

a = rand(1000, 1000);
b = rand(1000);
c = 1.0
out = similar(a);
@btime $out .= $a .+ $b .- sin.($c)
using GeometryTypes
const Point3 = Point{3, Float32}
super_custom_func(a, b) = sqrt(sum(a .* b))
a = rand(Point3, 10^6)
b = rand(Point3, 10^6)
out = fill(0f0, 10^6)
@btime $out .= super_custom_func.($a, $b)
@btime sum($a .+ $b .- sin.($c))

Admittedly that’s also not 100% comparable, but I hope you see my point :wink:

Maybe POC/WIP: leverage LLVM function attributes for better CSE by jrevels · Pull Request #29566 · JuliaLang/julia · GitHub fixes this.

2 Likes

@Liso, @sdanisch’s point, in case it’s not clear is that those two files don’t do the same thing at all. Your comparison is between the C++ code using a solution to the challenge and Julia code implementing a solution to the challenge. Of course usage code—in any language—is much simpler and more readable.

2 Likes

Well :slight_smile: In real life libraries/ecosystem is part of language. In real life challenge for users could not avoid libraries.

Which very good support idea that C++ is significantly improving!

Yes I see. :slight_smile:

Thank you for really interesting challenge! :slight_smile:

I hope we will see yours analysis and analysis from other peoples too in near future… :stuck_out_tongue_winking_eye:

I fully agree.

But if you use xtensor then usage code is very similar! And it is most important here from my (and I think any user) perspective.

C++ language expressivness is so strong that support library witch support very similar usage code than Julia! (quite fantastic for old language, isn’t it?)

If we think xtensor as initiative to cover broadcast-array-function problem it is not so important (from user perspective) if it is backed directly into language or if language is so strong that libraries like xtensor are possible (and exists) in that language.

Maybe @sdanisch (or anybody) could show that xtensor does not cover some part of “broadcast problem” which Julia cover simply.

Because somebody not experienced in this area (like me) could think that xtensor is really universal, comparable to what Julia could do today for “broadcast problem”.

This one probably needs a little more words. Today’s Julia need to do this compilation again and again while C++ could simply create reusable executable machine code.

The broadcast feature in Julia solves one very specific problem. “Vectorized” systems like Python/NumPy, R and Matlab have lots of vectorized functions which operate on whole arrays element-wise. This is technically necessary in these systems because otherwise they cannot get performance: you need to queue up enough work in the slow high-level language to be done all at once in the fast low-level language or you can never hope to amortize the slowness of the high-level language.

Neither Julia nor C++ has this problem since you can just write for loops or call functional programming operations like map and they’re fast with any kind of scalar operation. However, people still want to write vectorized style code because it’s convenient and expressive. But where do you draw the line about what functions are vectorized or not? There’s no good place to draw that line—it’s inherently arbitrary. In older versions of Julia it was some arbitrary subset of “traditional” vectorized functions, but it really had no rhyme or reason beyond historical happenstance. In Julia 1.0 we don’t have any vectorized operations built-in: instead, we have a completely general mechanism for vectorizing any scalar function at all, whether built-in or user defined. It also has a convenient syntax and allows “fusing” so that we don’t have to allocate all the temporaries that are needed in old-style vectorized languages.

6 Likes

In any case, the way this thread is turning out is why I am cautious of “challenges” like this one: in the end, the discussion focuses too much on “winning” it, which is not always so clearcut.

C++ is a very robust language, with a lot of industry support and developer time behind it. But at the same time its language development process is much more bureaucratic formalized than Julia, so it is a small miracle that it can be as friendly as it is. It is still amazing what people end up doing with it.

In contrast, Julia’s broadcasting framework and the features that make it possible were developed much more organically, with gradual refinement, relatively rapidly without fundamentally redesigning the language. Eg fusion went from a proposal to an implementation in the course of a few months (if I remember well, in the summer of 2016), then rewritten after that. It is not only the end result, but this process that is a great strength of the language.

6 Likes

Yes. The danger of challenges is that they use to be about winning.

But I think this is some kind of win-win situation here.

Challenge says:

A first prototype took me half an hour to write, and a further 1.5 hours to optimize all cases to match Julia’s SIMD optimized broadcast implementation in terms of performance. This should be the baseline for truly winning this challenge.

We got several lessons here:

  • doing something in fast language doesn’t auto-magically make code fast
  • benchmarking in highly optimized compiler era is treacherous
  • C++ is still admirable
  • it is easier to start challenge than maintain it

What could we (or Simon) do better to make baseline faster?

I suppose that Simon did @code_* works so what next could be useful to avoid big (12x slower code) penalty?

I think that varying broadcast-functions, benchmarking them standalone and comparing performance impact could do right work here (what I am trying to remember as main lesson).

I am curious if somebody has another proposals/ideas/comments/lessons!

QuantStack has many great stuffs that make C++ more friendly to new users. xframe is also very promising. Not only C++ is getting more readable, Fortran is also much easier nowadays (array initiation with square bracket, coarray, OOP, etc. ). I guess when C++ has a good package management system under windows (now there are conan and vcpkg), its users will increase faster. On the other hand, R and Python are getting faster.

I am not sure if the two language issue will still be an issue in the future when hardware gets cheaper and many calculations are done on clouds.

I am not sure if I am making a proper analogue. BSD distros used to emphasize solving issues under one consistent framework, and they build a whole system. While Linux distros only share a kernel, and glue all kinds of other utilities from different tools. In terms of consistency or purity, BSD distros seem better. While Linux distros provide better support for new hardware and get more users. My point is that it makes little sense to emphasize that a package is pure in Julia. As long as Julia syntax is as easy as Matlab and the speed is OK, then people will like it.

1 Like

Hardware already did get cheaper and faster. That never changes anything. Calculation in the cloud costs money. Faster implementation means lower cost. This circle never ends. You’re always pushing the envelope. (I have to quote the old, possibly apocryphal “640K ought to be enough for anybody.” Sorry about that.)

That emphasizes the end user perspective, and ignores the developer perspective. The point of pure Julia implementation is to encourage contributions, and to show, ‘hey, you can write featureful and performant software, just knowing this language.’

3 Likes

Sorry but I have different opinion here. Dropbox for example had problem with python on theirs cloud because if you are doing big things costs have big difference too. If you need to buy 12x times more machines or pay 12x bigger bill for electricity (or if your mobile device need to be recharged 12x more often) it is real problem.

But you also remind me second language problem which I forgot to mention in my previous post. Good, proper algorithm in one (good fast) language was not enough in this challenge. I am not sure (and I need to thinking about it more) but it seems that it is good to have expert on llvm optimizations too (which looks like second language).

Does this look like one language we love?:

julia> tst(a,b,c) = a .+ b .- sin(c);

julia> @code_llvm tst([1,1,1],[2,2,2],3)

; Function tst
; Location: REPL[26]:1
define nonnull %jl_value_t addrspace(10)* @julia_tst_35341(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), i64) {
top:
  %gcframe = alloca %jl_value_t addrspace(10)*, i32 4
  %3 = bitcast %jl_value_t addrspace(10)** %gcframe to i8*
  call void @llvm.memset.p0i8.i32(i8* %3, i8 0, i32 32, i32 0, i1 false)
  %4 = alloca [1 x { i64 }], align 8
  %5 = alloca [1 x { i64 }], align 8
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"()
  %ptls_i8 = getelementptr i8, i8* %thread_ptr, i64 -10920
  %ptls = bitcast i8* %ptls_i8 to %jl_value_t***
; Function sin; {
; Location: special/trig.jl:53
; Function float; {
; Location: float.jl:269
; Function Type; {
; Location: float.jl:254
; Function Type; {
; Location: float.jl:60
  %6 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 0
  %7 = bitcast %jl_value_t addrspace(10)** %6 to i64*
  store i64 4, i64* %7
  %8 = getelementptr %jl_value_t**, %jl_value_t*** %ptls, i32 0
  %9 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 1
  %10 = bitcast %jl_value_t addrspace(10)** %9 to %jl_value_t***
  %11 = load %jl_value_t**, %jl_value_t*** %8
  store %jl_value_t** %11, %jl_value_t*** %10
  %12 = bitcast %jl_value_t*** %8 to %jl_value_t addrspace(10)***
  store %jl_value_t addrspace(10)** %gcframe, %jl_value_t addrspace(10)*** %12
  %13 = sitofp i64 %2 to double
;}}}
  %14 = call double @julia_sin_35320(double %13)
;}
; Function materialize; {
; Location: broadcast.jl:748
; Function instantiate; {
; Location: broadcast.jl:266
; Function combine_axes; {
; Location: broadcast.jl:422
; Function broadcast_axes; {
; Location: broadcast.jl:206
; Function axes; {
; Location: broadcast.jl:216
; Function _axes; {
; Location: broadcast.jl:218
; Function combine_axes; {
; Location: broadcast.jl:422
; Function broadcast_axes; {
; Location: broadcast.jl:206
; Function axes; {
; Location: abstractarray.jl:75
; Function size; {
; Location: array.jl:155
  %15 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %16 = bitcast %jl_value_t addrspace(11)* %15 to %jl_value_t addrspace(10)* addrspace(11)*
  %17 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %16, i64 3
  %18 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %17 to i64 addrspace(11)*
  %19 = load i64, i64 addrspace(11)* %18, align 8
;}
; Function map; {
; Location: tuple.jl:162
; Function Type; {
; Location: range.jl:298
; Function Type; {
; Location: range.jl:289
; Function max; {
; Location: promotion.jl:436
  %20 = icmp sgt i64 %19, 0
  %21 = select i1 %20, i64 %19, i64 0
;}}}}}}
; Function combine_axes; {
; Location: broadcast.jl:423
; Function broadcast_axes; {
; Location: broadcast.jl:206
; Function axes; {
; Location: abstractarray.jl:75
; Function size; {
; Location: array.jl:155
  %22 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
  %23 = bitcast %jl_value_t addrspace(11)* %22 to %jl_value_t addrspace(10)* addrspace(11)*
  %24 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %23, i64 3
  %25 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %24 to i64 addrspace(11)*
  %26 = load i64, i64 addrspace(11)* %25, align 8
;}
; Function map; {
; Location: tuple.jl:162
; Function Type; {
; Location: range.jl:298
; Function Type; {
; Location: range.jl:289
; Function max; {
; Location: promotion.jl:436
  %27 = icmp sgt i64 %26, 0
  %28 = select i1 %27, i64 %26, i64 0
;}}}}}}}
; Function broadcast_shape; {
; Location: broadcast.jl:427
; Function _bcs; {
; Location: broadcast.jl:433
; Function _bcs1; {
; Location: broadcast.jl:439
; Function _bcsm; {
; Location: broadcast.jl:441
; Function ==; {
; Location: range.jl:698
; Function ==; {
; Location: promotion.jl:425
  %29 = icmp eq i64 %28, %21
;}}
; Function ==; {
; Location: promotion.jl:425
  %30 = icmp eq i64 %21, 1
;}
  %value_phi.v = or i1 %30, %29
;}
  br i1 %value_phi.v, label %L32, label %L20

L20:                                              ; preds = %top
; Function _bcsm; {
; Location: broadcast.jl:441
; Function ==; {
; Location: promotion.jl:425
  %31 = icmp eq i64 %28, 1
;}}
  br i1 %31, label %L32, label %L29

L29:                                              ; preds = %L20
; Function Type; {
; Location: array.jl:12
  %32 = bitcast %jl_value_t*** %ptls to i8*
  %33 = call noalias nonnull %jl_value_t addrspace(10)* @jl_gc_pool_alloc(i8* %32, i32 1424, i32 16) #2
  %34 = bitcast %jl_value_t addrspace(10)* %33 to %jl_value_t addrspace(10)* addrspace(10)*
  %35 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(10)* %34, i64 -1
  store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140477210827104 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %35
  %36 = bitcast %jl_value_t addrspace(10)* %33 to %jl_value_t addrspace(10)* addrspace(10)*
  store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140477219303520 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %36, align 8
;}
  %37 = addrspacecast %jl_value_t addrspace(10)* %33 to %jl_value_t addrspace(12)*
  %38 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 2
  store %jl_value_t addrspace(10)* %33, %jl_value_t addrspace(10)** %38
  call void @jl_throw(%jl_value_t addrspace(12)* %37)
  unreachable

L32:                                              ; preds = %L20, %top
  %.sroa.033.0 = phi i64 [ %28, %top ], [ %21, %L20 ]
;}}}}}}}
; Function broadcast_shape; {
; Location: broadcast.jl:427
; Function _bcs; {
; Location: broadcast.jl:431
  %.sroa.030.0..sroa_idx31 = getelementptr inbounds [1 x { i64 }], [1 x { i64 }]* %4, i64 0, i64 0, i32 0
  store i64 %.sroa.033.0, i64* %.sroa.030.0..sroa_idx31, align 8
;}}}}
; Function copy; {
; Location: broadcast.jl:768
; Function similar; {
; Location: broadcast.jl:193
; Function similar; {
; Location: abstractarray.jl:617
; Function similar; {
; Location: abstractarray.jl:618
; Function Type; {
; Location: boot.jl:411
; Function Type; {
; Location: boot.jl:403
; Function Type; {
; Location: boot.jl:394
  %39 = call %jl_value_t addrspace(10)* inttoptr (i64 140477405321664 to %jl_value_t addrspace(10)* (%jl_value_t addrspace(10)*, i64)*)(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140477210670736 to %jl_value_t*) to %jl_value_t addrspace(10)*), i64 %.sroa.033.0)
;}}}}}}
; Function copyto!; {
; Location: broadcast.jl:792
; Function copyto!; {
; Location: broadcast.jl:828
; Function axes; {
; Location: abstractarray.jl:75
; Function size; {
; Location: array.jl:155
  %40 = addrspacecast %jl_value_t addrspace(10)* %39 to %jl_value_t addrspace(11)*
  %41 = bitcast %jl_value_t addrspace(11)* %40 to %jl_value_t addrspace(10)* addrspace(11)*
  %42 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %41, i64 3
  %43 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %42 to i64 addrspace(11)*
  %44 = load i64, i64 addrspace(11)* %43, align 8
;}
; Function map; {
; Location: tuple.jl:162
; Function Type; {
; Location: range.jl:298
; Function Type; {
; Location: range.jl:289
; Function max; {
; Location: promotion.jl:436
  %45 = icmp sgt i64 %44, 0
  %46 = select i1 %45, i64 %44, i64 0
;}}}}}
; Function ==; {
; Location: tuple.jl:280
; Function _eq; {
; Location: tuple.jl:284
; Function ==; {
; Location: range.jl:698
; Function ==; {
; Location: promotion.jl:425
  %47 = icmp eq i64 %46, %.sroa.033.0
;}}}}
  br i1 %47, label %L61, label %L194

[...radically shortened...]

It might be more interesting in reverse, see what other languages can express a given problem in the most elegant and efficient manner, and then try in julia?

3 Likes

You are mistaken. Looking at @code_llvm is the last thing one should look at for optimizing code. Checking things in listed in the Performance tips, looking at @code_warntype, and profiling are much more useful and usually sufficient.

Dunno. Do you love LLVM? Curious, but there is no accounting for taste.

2 Likes