If tuplex can do it. So can Julia!

I got curious about Tuplex as it talked about at work and I started to read some of its papers and I found

Today’s data science pipelines often rely on user-defined functions (UDFs) written in Python. But interpreted Python code is slow, and Python UDFs cannot be compiled to machine code easily.

If you want to go down this route, wouldn’t Julia be the answer? So it seems the reason for tuplex to exists is for the user base. Otherwise, investing it Julia would’ve been the thing to do.

What are some components that are needed to make a Spark-like in Julia?

This has been discussed a bit in slack. But I just want to note down a few things. To have parallelized data framework, we need

  • A cluster. So a Kubenete cluster will do the trick. I think Spark can run on Kubenetes
  • Fault-tolerant abstraction. I am not sure if this is needed. But Spark started off with RDD. Tuplex is about tuples and exception. I need to read the tuplex paper in full to understand what’s the deal there. But this is the one I am not unsure about
  • A distributed dataframe that can do things like group-by and joins

If this Julia product becomes successful, then there is practical stuff like how to read from all sorts of data sources etc. For now, an MVP with a minimal set of features should be sufficient.

I think the question is not how to implement it, but who actually needs it. Some time ago I asked here what people need from a distributed computation framework and got silence in response. Spark evaluated into just a more flexible SQL database. Distributed machine learning is mostly concerned with multi-GPU training and has its own frameworks. UDFs are rare and usually it’s easier to just implement them in the native language for a framework (e.g. in Java/Scala for Spark).

I think a feature-rich (dash board, auto job stealing) and reliable (julia’s latency loading image file and pkgs on worker machine with different architecture, separate demo process on host) Dask alternative would be sufficient for a lot of research application where the core workload is a mapreduce.

A lot of unnecessarily complex python framework/orchestrator seem to me is 50% because it’s impossible to write half decent and memory efficient user function in python. That’s not a problem for Julia.

3 Likes

That’s a good question. Part of the market fit research. I think “big data” was so hyped that ppl flogged to Hadoop. Then Spark just went, hey, you can do this in memory, and got 10x performance out of it. And ppl started flogging to big data.

To me UDFs are huge. Also, there’ the problem of larger-than-RAM data. You need to solve that somehow.

If that’s true that we totally need a Julia alternative.

SSD disk? If a dataset is smaller than 1Tb it’s usually easier to process it on a single machine with a large disk. If it’s larger than 1Tb it might be easier to use distributed computing. I used to work on projects that handled petabytes of data on a daily basis, distributed computing is still the must in such context. But there aren’t really many companies and projects that need it, and very few of them intersect with the interests of the Julia community.

I think these few projects on the intersection of Julia and really big data are actually in acute need of a convenient tool. But again, without “market fit research” it’s hard to understand how such a tool should look like.