If tuplex can do it. So can Julia!

xiaodai · July 20, 2021, 12:44pm

I got curious about Tuplex as it talked about at work and I started to read some of its papers and I found

Today’s data science pipelines often rely on user-defined functions (UDFs) written in Python. But interpreted Python code is slow, and Python UDFs cannot be compiled to machine code easily.

If you want to go down this route, wouldn’t Julia be the answer? So it seems the reason for tuplex to exists is for the user base. Otherwise, investing it Julia would’ve been the thing to do.

What are some components that are needed to make a Spark-like in Julia?

This has been discussed a bit in slack. But I just want to note down a few things. To have parallelized data framework, we need

A cluster. So a Kubenete cluster will do the trick. I think Spark can run on Kubenetes
Fault-tolerant abstraction. I am not sure if this is needed. But Spark started off with RDD. Tuplex is about tuples and exception. I need to read the tuplex paper in full to understand what’s the deal there. But this is the one I am not unsure about
A distributed dataframe that can do things like group-by and joins

If this Julia product becomes successful, then there is practical stuff like how to read from all sorts of data sources etc. For now, an MVP with a minimal set of features should be sufficient.

dfdx · July 20, 2021, 7:26pm

I think the question is not how to implement it, but who actually needs it. Some time ago I asked here what people need from a distributed computation framework and got silence in response. Spark evaluated into just a more flexible SQL database. Distributed machine learning is mostly concerned with multi-GPU training and has its own frameworks. UDFs are rare and usually it’s easier to just implement them in the native language for a framework (e.g. in Java/Scala for Spark).

jling · July 20, 2021, 7:50pm

I think a feature-rich (dash board, auto job stealing) and reliable (julia’s latency loading image file and pkgs on worker machine with different architecture, separate demo process on host) Dask alternative would be sufficient for a lot of research application where the core workload is a mapreduce.

A lot of unnecessarily complex python framework/orchestrator seem to me is 50% because it’s impossible to write half decent and memory efficient user function in python. That’s not a problem for Julia.

xiaodai · July 20, 2021, 11:35pm

That’s a good question. Part of the market fit research. I think “big data” was so hyped that ppl flogged to Hadoop. Then Spark just went, hey, you can do this in memory, and got 10x performance out of it. And ppl started flogging to big data.

To me UDFs are huge. Also, there’ the problem of larger-than-RAM data. You need to solve that somehow.

If that’s true that we totally need a Julia alternative.

dfdx · July 21, 2021, 8:22am

SSD disk? If a dataset is smaller than 1Tb it’s usually easier to process it on a single machine with a large disk. If it’s larger than 1Tb it might be easier to use distributed computing. I used to work on projects that handled petabytes of data on a daily basis, distributed computing is still the must in such context. But there aren’t really many companies and projects that need it, and very few of them intersect with the interests of the Julia community.

I think these few projects on the intersection of Julia and really big data are actually in acute need of a convenient tool. But again, without “market fit research” it’s hard to understand how such a tool should look like.

Topic		Replies	Views
What do you need from a distributed computation framework? Julia at Scale	1	675	July 20, 2021
When will Julia compete with Spark? Julia at Scale announcement , spark	16	8678	June 5, 2021
State of distributed processing in Julia Julia at Scale	3	1636	May 14, 2019
Expected Performance of Julia within a Spark Environment? General Usage performance , spark	7	551	December 7, 2022
Dask and Dask-cuDF Julia alternative? GPU question , package , gpu , dataframes	8	3952	July 18, 2019

If tuplex can do it. So can Julia!

What are some components that are needed to make a Spark-like in Julia?

Related topics