Sure, go ahead, I haven’t tested against the other implementations though, that they’re all compatible. How fast are the others on your machine?
Had not tried the python implementation. Did that just now and got it to run in about 1 second, so we are still slower…
Interesting, I wonder which part of the implementation is slow
Sure, please go ahead. I haven‘t had time to test the various iterations in this thread, but I will gladly take a look at the PR if you post a link.
Work has overwhelmed me over the past two weeks, I will try to get back to this by the end of this week.
Finally got around to creating a PR for your fork @simsurace → Julia Performance Improvements from Discourse Thread by Westat-Transportation · Pull Request #2 · simsurace/arrow-experiments (github.com)
Hi!
I’ve been investigating julia performance lately, in particular with relation to arrow and bumped into this thread. I tried optimizing the example code and ran some simple benchmarks and pprof profiles. My findings;
- I tested all versions against the python reader which doesn’t suffer from first run lag, as well as against plain curl to remove the deserialization variable.
- Python server runs in ~4.2 seconds against the python client and in ~1.7 seconds against curl.
- On my end the original implementation runs in ~13.3 seconds against both python and curl (python client claims slightly less time then actually takes to run).
- The julia implementation (both in the PR and in the original) generates all the random numbers and incidentally the vectors ad-hoc per request vs python making a single table and then repeatedly sending that, so I experimented with making the julia implementation follow the python implementation more closely:
- Materializing the entire dataset ahead of time and serializing per request cut down the second request time (SRT) to 6 seconds.
- Serializing the same chunk repeatedly (as in the python implementation) does not improve the python client performance, but improves SRT curl performance to ~3.8 seconds
- Using Arrow.write creates a new writer objet every batch, and this produces additional overhead as I understand. Using that cuts SRT curl time to about 3.6 seconds. However, python client complains about broken data, although julia client accepts it.
- Whether the chunked header is set or not, python accepts the data but claims it is a single recordbatch. Furthermore adding the appropriate linebreaks etc (as in the python example) makes the python client not accept the input as valid (julia does not complain). Regardless, this adds some over overhead such that curl SRT is ~4.8s. Furthermore, the byte count line as in the PR contributed an additional ~20% (in profiling) on my end and I replaced it with
write(stream, string(nbytes, base=16)*"\r\n")
- I also tested creating a single arrow buffer and sending it over and over. This is a bit of cheating on the benchmark but I felt thought it interesting to have the data for more context on performance; SRT on curl is ~1.15 seconds and on python client is 4 seconds.
General observations:
- The julia arrow implementation in a sense has a parser and a serializer for the arrow data, but they are transformed into native julia format. This means that even if one constructs an Arrow.Table, each Arrow.write call still needs to do many checks and conversions to create the arrow representation, whereas the pyarrow implementation keeps the data in the native C++ data structures behind the API and so doesn’t need to do much work when serializing it back to messages, so this is comparing different kinds of benchmarks. In python the pyarrow api exposes different (C++) vectorized operations while being restricted to the DSL of the API, while in julia we get native datastructures.
- In all the tests I’ve done so far the python client always claims it received a the python client claims to receive different amounts of bytes from julia vs python.
- Moreover, in all the methods that use a stream to write responses (similar to the approach in the PR), that do not generate errors, python sees only single a single 4096 record batch when interacting with julia - including when resending the same buffer. I’m not sure yet it it’s a bug in the julia or python implementation, or incorrect usage on my (our - since I’m using a very similar flow to the one in the PR).
While investigating I was able to fix PProf.jl to produce profiling data (I was affected by this issue - i’ll share those a bit later if anyone wants to take a look as well.
@slumpy, these are very astute General Observations. As you realized our Julia server is not wire compatible with the Python implementation. When I redid it, I made the server return multiple single record-batch Arrow tables, each in a part of a multi-part return. That is different from what the Python code does, where it does return a record-batch stream per part in a multi-part return.
I did this because I did not see a way in the Julia Arrow pkg to get just the bytes for a record-batch, instead we can only write the whole table. I guess that comes from how the API was designed and the fact that it is different from how Arrow is implemented in other languages; not allowing zero-copy between language runtimes for example.
Thanks for the feedback!
Frankly i found it somewhat concerning that we are not binary compatible with other implementations, considering this is supposed to be a “standard” format. What do you perceive to be the main challenge in getting binary compatibility?
I think we would need to modify the Arrow.jl API to allow more control when reading from a batched stream.