Yep. Data pipelines are not great with Julia. They could be, but some important tools are still just in early stages. Reading parquet files has been a bottleneck for many, and hive partitioned parquet files are ubiquitous.
That doesn’t seem right. You should always be able to match MATLAB (maybe excluding startup overhead), or beat it, so have you read the performance section of the manual? Is it just about speed, or something you miss library-wise (you can call MATLAB… and more, e.g. Python).
There have been threads here where matlab codes run faster than Julia ones, particularly because of the tuning of the linear algebra machinery behind the code. Matlab uses MKL by default, to start with, so the very first experience is to see Julia run slower many times. Most times with one or other adjustment the Julia code matches the matlab one, but for an experienced Matlab user there must be another motivation to try Julia further then.
That’s not to mention that many of the base functions are multithreaded in Matlab, which improves the first impression. Theoretically one may be able to beat Matlab, but often it requires significantly more mental load
Speed advantage over Matlab is a red herring. It can be had, and often it does materialize, but at least in my opinion it is so much more important that Julia is a superior programming language for large system development. (Large system: Obviously, it means more than just a few dozen lines in a script. Think several hundred or a few thousand lines. A substantial package, for instance.)
The advantage is actually acute even for very small codes. If you just need a couple of short functions for trying something out: congratulations, for Matlab you need to create two files with names that don’t conflict with anything you’ve got lying around in your working directory or in path. Perhaps create a new folder to hold them.
In Julia (or python, for that matter), you just define
bar as throwaways in the repl, and off you go.
Matlab is excruciatingly non-ergonomic, it’s a daily annoyance, and I’m always baffled when people praise its usability.
I don’t want to turn this thread into Julia vs Matlab, so if you are interested I am happy to provide more details in private mesaages.
I will only try to tell a condensed story here.
I tried translating three codes (1) simple Bayesian Vector Autoregression code (2) an algorithm based on the Kalman Filter and also (3) more Bayesian VAR code with a Kalman Filter (a complex conjugation of 1 and 2).
The Kalman one (2) was comparable to Matlab. I asked ppl around me who are holding Julia courses for help and they looked at it and suggested it wasn’t language specific but a BLAS thing and that it is normal to be comparable.
The other codes are slower and I guess it is because I don’t know the language and cannot code well in it. I am NOT a programmer and I have so much experience only with Matlab that I get annoyed when things don’t work like I expect them to, but that is NOT Julia’s problem. And matlab is pretty high level language, so it is not that close to real programming. For example i didnt even know much about types before I started with Julia.
Overall, the fixed costs of coming to a new language are really high, depending on the individual circumstances and I tried multiple times to pay them only to be frustrated again and again. For example, Bayesian Econometrics relies heavily on loops and doing the same thing 20 000 times. And julia has its own scoping rules for a for loop that were a huge hurdle for me to overcome. And when I put it in a function to get it to work like I expect it to and had an error, I couldn’t even use the debugger for help because back when I started there was no debugger (julia 0.5)…
TL, DR (and to make this post more relevant to the topic) learning something new comes at a time cost, which not many may have the ability to pay and when for them something else works, it gets almost impossible. So if one wants to have it adopted like the topic strives to I think thinking about people like me (which might be in the audience) could be beneficial.
P. S. I still haven’t given up on the language, I still lurk around here nad try to follow the developmens and try from time to time new things and hope to learn it to find that “speed” and “ease” but so far I feel I am well away from that.
Translating from English to Julia is effortless. That is valuable IMHO. I am old and simply don’t have the patience for the kind of insane contortions (i.e. vectorisation) one needs to go through with the numerical computation APIs (into C/C++) of other popular languages.
If you have a small example of a function that seems slow that you can share, people around here love to optimize.
Yes, I have been meaning to do that. I wrote one function that embodies pretty much every paper that I come across and have been meaning to share it since, half a year? Just haven’t come around to it, because I also want to have enough free time to incorporate people’s suggestions (which would hopefully lead me to a rabbit hole of new knowledge).
Oh no, I wrote a massive answer to that post but I deleted it accidentally…
Key message was: for me, Matlab is the most ergonomical language currently and I’m using python, Julia, c, c#, and Matlab. It’s mainly due to awesome auto completion and documentation, and everything works in the same way, without the need to learn a DSL for every new package. There are disadvantages for sure, but many of them got already solved (which means, people have to stay up to date, as in every language) or are worked on.
Your specific example is not an issue anymore since 2016, when local functions were introduced… And of course you can use lamdas, too.
I would love to see the same maturity in python and/or Julia, but for my daily work, there isn’t.
The toolboxes from Mathworks were always very nicely polished, but the 3rd party package management was either non-existent or terrible. It worked nicely when assembling the legos that Mathworks provided, if the correct pieces were available, but trying to integrate and maintain code from the file exchange quickly became unsustainable. Mex files also introduced a new headache.
There’s a thousand things to say about this, but mainly:
No, this still doesn’t work. You have local functions in scripts, when you run them as a whole script in a saved file. You cannot define them from the command line, nor can they be defined when you run scripts cells-by-cell. This encapsulates the majority of my script usage.
Matlab functions need to live in a named file, that’s my complaint.
Unlike in Julia, lambdas in Matlab are limited to a single expression, which means they don’t solve my problem for multi-expression functions.
You can. Ctrl+enter in a cell can call local functions. That’s definitely true for live scripts, and iirc for normal script files too (since the last, big editor update).
Right, the file needs to be saved. But it does not have to be a function file, script is sufficient.
I guess, our workflow differs. I develop in a file, because of the better syntax highlighting and auto completions (one of the killer features as I mentioned), and because usually I’ll have to put it into a script or function anyway. Parts of code are selected and run via f9, or cells via Ctrl+enter. Quite similar to a notebook file.
I get your point, function definition is not as flexible as in Julia. But there’s also no need (for me). And I do exploratory coding every day.
Right. There are workarounds for Multiline lambdas but that’s indeed not a strength of ML. Local functions are sufficient usually. If I need the function to be called from other scripts or REPL, I select the local function and click refactor to new file.
In a different thread maybe? I’d be really interested.
That’s nice, actually. Thanks. I definitely tried this, not that long ago, and it didn’t work. I also read the release notes for every new release, so I don’t know how I missed this.
Still, I don’t want to create a named file just to try out a small throwaway function.
Oh, I have many complaints. Maybe I can dig up one of the old threads later.
No profanity please! I understand that’s how people often speak conversationally in small groups, but this is a widely read list. Can I request you to edit your response to avoid the F word?
Then my initial post “hit” the wrong guy I know too many people that state “this tool is much worse than this and that” without keeping themselves up to date. Which is kind of natural I guess.
Happy I could help a bit
Yeah I understand that.
Of course, apologizes. It was not meant to be aggressive or whatever. Edited it.
Could you expand on that ? What kind of pipelines, and what sort of data ?
Hi, happy to expound.
For example, a task of simple tabular reports summarizing orders data over the year. My weekly analysis product needs to be reusable, so my transforms will be pipelines that land intermediate data sets to S3.
My preference is to stay in Julia as much as possible, from extract to finished output which is several tabular dataframes that are readable for business team.
The company is big, data is big. A year of orders is appx 30TB in compressed parquet files, partitioned by region, order_date, and product_line. My interest area is one region and certain product lines, so my first task is filtering my working dataset down before applying complicated transformations that serve my data consumer’s interests.
In spark, this step is simple (lets assume my team has already established the best spark configurations and I can spin up enough EMR instances to handle this step in a reasonable time):
// provide dataset root, spark will push partition filters down ahead of reading to memory
val df = spark.read.format("parquet").load("s3://our_datalake/orders_dataset/")
.select($"order_date", $"region", $"product_line", $"col_1", $"col_2", $"col_etc")
.filter($"region" == 1)
.filter($"product_line" == 22 || $"product_line" == 40 || $"product_line" == 121)
.filter($"order_date" >= to_date("2022-01-01"))
.filter( /* additional product filters */)
// now write our slimmed down to be accessible for detailed transforms
This first step is my main hurdle in julia. The current state of Julia parquet compatible packages dont support easy read/write operations with partitions. I have to first write julia code to walk the file structure and discover all the ‘*.parquet’ files and eliminate objects with prefixes that don’t contain my predicate filters. So step one is building the list of files to read before passing it to my parquet reader in julia.
Similar effort is required on the write end of this filtering operation.
Recall that partitioned datasets on disk look like:
# and etc...
(Yes, I would like to publish a julia package that abstracts this away as in the spark API, I just haven’t had time.)
The first filtering/write operation gets transitioned to a daily job that reads the previous day’s orders and writes the filtered set to my working dataset.
Now I can start using Julia, but to be clear I want to get rid of the previous spark step and work in julia end-to-end.
At this point my dataset on s3 is ready to be read into memory as Julia dataframes, and I can happily make all the transformations that fulfill my request. I still have to manage the file list on read, but its easier now that I’m reading every file and not filtering.