I got curious about Tuplex as it talked about at work and I started to read some of its papers and I found
Today’s data science pipelines often rely on user-defined functions (UDFs) written in Python. But interpreted Python code is slow, and Python UDFs cannot be compiled to machine code easily.
If you want to go down this route, wouldn’t Julia be the answer? So it seems the reason for tuplex to exists is for the user base. Otherwise, investing it Julia would’ve been the thing to do.
This has been discussed a bit in slack. But I just want to note down a few things. To have parallelized data framework, we need
- A cluster. So a Kubenete cluster will do the trick. I think Spark can run on Kubenetes
- Fault-tolerant abstraction. I am not sure if this is needed. But Spark started off with RDD. Tuplex is about tuples and exception. I need to read the tuplex paper in full to understand what’s the deal there. But this is the one I am not unsure about
- A distributed dataframe that can do things like group-by and joins
If this Julia product becomes successful, then there is practical stuff like how to read from all sorts of data sources etc. For now, an MVP with a minimal set of features should be sufficient.