I'm trying to write something with the best possible performance (for a library). I keep wishing I was writing C! is that normal? Should I just write C?

for (i=0; i < string; ++i) {
    ++j;
    if (string[i] == '\\' {
        ++i;
        bytes_written = some_func(string+i, buffer+j);
    }
    keep doing stuff...
}

It was a case where I want to be able to increment i in the body of the loop. Normally, I’d use Iterators.Stateful, but I don’t know how efficient it will be, and I trust integer arithmetic more. Not that that is impossible to do it in Julia, it’s just less natural, at least for me, right now.

for i=1:length(string)
    j += 1
    if (string[i] == '\\' 
        bytes_written = some_func(string+i+1, buffer+j)
        continue   # move to the next iteration where i is incremented
    end
    keep doing stuff...
end

1 Like

This might be just what the Dr. ordered! I will look at Strs.jl. I’m not too sure what methods are needed for AbstractString and AbstractChar. Can you point to some relevant documentation about this topic?

waaaaaat? I don’t thing pointer arithmetic works like that in Julia.

Oh. I see. string was supposed to be a pointer to an array, but I there is a bug in my loop declaration (story of my life…). should have been something like i < length(string). I guess I can do something with @view. That’s the Julia version of pointer arithmetic, right?

Ah, right, I just copied this line and shifted index by 1 to account 1-based indexing.

I guess I can do something with @view . That’s the Julia version of pointer arithmetic, right?

It depends on what some_func actually does. For example, if some_func compares string’s char at position i + 1 to another char in buffer’s position j, then it’s just indexing:

some_func(string[i + 1], buffer[j])

If in C string + i and buffer + j are pointers in the same array (such as start and end of iterator), you can rewrite it using a view:

some_func(@view original_array[i + 1 : j])

or using 2 indices:

some_func(original_array, i + 1, j)    # iterate over original_array from index i + 1 to index j
1 Like

Compare the hash functions :wink:

Edit: Post got deleted.

Yes… I realized that too just after posting that message and thought I’d save everyone the confusion and potential flaming :slight_smile: The data structures I’ve actually struggled with are in DataStructures.jl btw, not Dict, I just thought I’d use that for a better example. I guess I should post some sample code on this forum first before giving up on Julia :slight_smile:

Which is interesting “solution” to two language problem, right? :stuck_out_tongue:

1 Like

The point is not to write assembly (or llvm, or indeed, lowered/type inferred julia code) but to understand the kinds of things your julia code will be turned into so you can understand the cost of various constructs. It’s mainly useful for detailed optimizations of innermost loops as a useful complement to benchmarking. Personally I’m not fluent in assembly at all, but I can read it just well enough to see whether desired optimizations are happening.

3 Likes

This should not be necessary. I don’t know assembly or LLVM primitives either, but by looking at the output of @code_llvm you can get a pretty good idea of what is going on, what the compiler figured out, and what it didn’t. But even that is not necessary in most cases, go with @code_warntype 99% of the time.

I would also argue that @code_native is less useful than @code_llvm for most cases, since it is rather a lot of information with even less structure, and easy to get lost in. It is also a common mistake to assume that assembly code length somehow maps to execution speed; this does not necessarily hold when optimizing the same computation.

Also, about the original post: I think that “best possible performance” is not a good goal except in very special situations. Diminishing returns kick in rather early in Julia (because it is so fast already) — typically, in order of descending importance, after you have

  1. chosen the right algorithm and data structures,
  2. made sure things are concretely typed,
  3. made sure the compiler can figure out types (@code_warntype),
  4. took care of major allocations in inner loops,
  5. profiled, optimized, and annotated some parts (eg @inbounds), but strictly in Julia.

These days, especially with master, this usually gets you within 10–20% of what you could get with micro-optimizations which are much, much more costly to maintain in the long run. Be wary of the latter: a lot of libraries are littered with hacks that looked like a good idea at the time, but then 3 years later are still around and may even be suboptimal until someone bothers to benchmark and refactor (I am guilty of doing this too, but the punishment is self-contained when I need to touch said code).

It sometimes happens that you run into issues where the compiler could figure out something but currently doesn’t. In these cases, you can either wait for the fix and not worry about it in the meantime, or separate the functionality to a kernel function that contains the workaround, and make a note about the issue so that you can remove it once it is fixed.

The strategy I would recommend is writing clean Julia code, and trusting that the compiler is getting more and more clever, and confine micro-optimizations to 1% of the cases where it is really needed.

14 Likes

Assembly instructions are what the processor actually will execute in the end no matter the programming language? In other words, it is the point of all programming languages to turn input code into assembly instructions. Being able to inspect this generated code is very useful and if you want to be able to write performant code you will need to learn it, again, irrespectively of the programming language used.

7 Likes

While being able to understand assembly is a very useful skill, I would argue that the majority of Julia users trying to write reasonably performant code (within, say, 50% of C/Fortran) can get away with @code_warntype 90% of the time and possibly @code_llvm for the remaining 10%. Fortunately, the latter is still pretty readable and much easier to follow than native asm.

3 Likes

With the exception of algorithms that are designed around a specific processor instruction (e.g. aes, more fancy SIMD, bit manipulation). This is currently super annoying in julia (compared to C). Second exception are C-style unions that are just not supported in julia (kinda works by pointer convert, but ugly and often emits very slow and bad code), and third exception are mutable fixed-size stack allocated temporary arrays (kinda works by unsafe_store! to pointer_from_objref but is more ugly than C).

Some day I want to port/adapt adaptive radix trees to pure julia, but this needs all three of these features.

3 Likes

Another thing you can do is use parametric types pass the missing information to the compiler, and use @pure if necessary.

Agreed, I think @code_warntype might be more useful in general, because you can see if the type allocations are all stable and where there is an issue with simplifying, which is more important.

2 Likes

While I don’t disagree about the importance of these features, I think that these discussions make it difficult for new users to focus on the relatively low-hanging fruits of optimizing Julia code with the simple to use facilities that Julia offers, without worrying about CPU instructions or similar.

There is a healthy tendency to squeeze out performance from code on this forum, but sometimes it very quickly gets to eyeballing output of @code_native and discussing SIMD, so I think that some users get the impression that getting reasonable performance out of Julia is a daunting enterprise that involves assembly, SIMD, low-level tricks and interfacing with LLVM. But this is not true: while one can always optimize things, common usage is mostly about avoiding pitfalls, and then getting reasonable performance.

8 Likes

You should probably benchmark LazyJSON and JSON2. Both are much faster than the JSON package. Even if you don’t use them, the code might provide inspiration.

TextParse may also be useful, too.

5 Likes

It’s hard to give specific help without specific problems that you’re trying to tackle but one can certainly write high performance string code in Julia using the native String type, although it is a bit different than C and does take a little getting used to. The ncodeunits and codeunit functions give you read access to raw bytes at C performance. The nextind and prevind functions allow you to navigate between Unicode characters (using code unit indices). Unless you need to modify the string data in place, I suspect you don’t need to convert anything to byte vectors and that doing so may make things less efficient and more complicated. If you have specific questions, I’m happy to give advice.

9 Likes

TextParse.jl has some very highly optimized routines for parsing. In general, at this point, we seem to be very competitive with some of the fastest implementations that exist in C (see here for the latest benchmarks).

Two things you might take a look at: https://github.com/JuliaComputing/TextParse.jl/blob/master/src/utf8optimizations.jl has the really performance critical routines highly optimized using codeunit calls. For the best performance, that seems the way to go. Another neat thing is https://github.com/JuliaComputing/TextParse.jl/blob/master/src/VectorBackedStrings.jl. We use that to provide an AbstractString over a memory mapped file, which is great for parsing files that are very large.

So, from spending a fair bit of time with that code base over the last half year, here are some reasons why I think it makes a lot of sense to write this in pure julia and not C:

  • while the code might be slightly more verbose than a C version, it generally is mostly “safer”. Essentially no pointer arithmetic etc. And the code is very clear.
  • I think the biggest benefit to me is how super productive one can be when working on this type of code, with the help of Revse.jl and ProfileView.jl. Here is my typical workflow: I have a REPL open with Revise loaded, then load TextParse and BenchmarkTools.jl. I now load some large file under the profiler, so @profile csvread("foo.csv"), then I look at the profiling results with ProfileView.jl and immediately see where it would pay of to optimize code. I pick a certain function to optimize, based on that, and execute that under BenchmarkTools at the REPL: @btime tryparsenext(....). Alright, now I have a baseline. I work a bit on the code, hit up arrow in the REPL and run the benchmark again. Thanks to Revise.jl, the new code now is run, and I can immediately tell whether my new code is better or now. And then I repeat this cycle. I find that unbelievably productive. No long breaks while I have to wait for some compiler to finish something, and a super rapid feedback cycle with some really powerful tools.
  • I think this was mentioned above already, but the ability to seamlessly decide at which level you actually need really performant code is also nice. It is just super easy to say “oh, I guess this function also need to be faster”. You can just immediately start working on that function, without worrying how you might move it out into a C library and how things would work together at that poin…

There are some things that I think we could find better patterns for. For example, I’m pretty sure we could write some of this code in a more generic way, so that it has the performance we see right now, but not just for UTF8 encoded strings, but also for other string types. One of my goals is to be able to plug a non-UTF8 string type under the hood of TextParse.jl, so that one can also read files that are encoded with some other encoding. I haven’t found the perfect balance for that issue, but I’m pretty sure it can be done and I just need to think a bit more about it :slight_smile:

8 Likes

@Sukera @stevengj @StefanKarpinski

All those who asked for sample code, here’s some:

https://github.com/ninjaaron/LibAaron.jl/blob/master/jsondec.jl

So I got this working, and it’s more than 3x faster for my usecase than any of the JSON alternatives mentioned in this thread (probably because it does much less; my usecase, right now, is dealing with 800 million RDF n-triples, which have JSON strings literals embedded in them with A LOT of unicode escapes in them, so my efforts have been mostly focused on making the conversion of unicode escapes faster–so I’m not interested in parsing huge JSON files at this moment, but in parsing a billion+ tiny JSON strings.)

I would say I’m happy with this performance result, but the code was kind of hard to write, and it’s not very idiomatic Juila. It does have one ccall in it, because it made dealing with Unicode escape parsing more than twice as fast, but it’s just a call to libc, not anything exotic, and certainly not any compile-at-home code. It contains many buffers and more pointers than it probably should, but the results are good. I mean really, the standard library is full of calls to libc.

Also, I commented out the JSON.jl benchmark. You can put it back, you want. Importing JSON.jl actually consistently makes the benchmark for my own code return an ~20% slower result. Not sure why…

Minor code review: You produce garbage on invalid things like raw"""\u123x""" (strtoul fails to signal the error).

Depending on your actual strings, you might get better mileage by using SIMD (pcmpeqb / vpcmpeqb / _mm_cmpeq_epi8 / _mm256_cmpeq_epi8) to search for backslashes (copy one 16/32 byte chunk, compare all its bytes to 0x5c at once in a single cycle, fastpath the case of no backslash, otherwise extract position of first backslash and handle it).

That currently needs some processor specific inline llvm though (Kristoffer is working on a package that aims to reach C-intrinsics level of convenience, but it’s not there yet, and not available in SIMD.jl either). If you need more speed, then I would probably recommend to write that single function (JSON string unescape, not entire parser logic) in C and ccall it, until the situation has improved. Even better, look whether someone already wrote a fast SIMD unescape function with bsd license (effectively in assembly, even if it is formally C code).

2 Likes