Why is Julia so great?

Can you use the julia DuckDB package?
Seems to have all the necessary capabilities with its parquet integration.

Thanks for the feedback. Are you happy with your workflow in Julia, performance-wise ?
Also, why choose Julia for this task ?

I was thrilled when I saw the Julia API and that it could read parquet, however the parquet data section examples show the same limitation. It can import many files from a directory, or it can import a list of files you provide.

The work still remains for me to:

  1. Identify all the paths to read in the hive directory structure
  2. create column names in the dataframe/duckdb-table for each of the partitions from the directory names (ie files under order_date=2019-07-22 get a order_date column added to the table with string values 2019-07-22 for all records in that parquet file).

This is all supported in the spark workflow, including inferred types from the directory name strings.

I suppose the next time I need to do this I’ll start working on a julia package that gathers up the file paths and keeps track of the columns and values that would get added to the resulting data frame.

Performance-wise, I like that julia is fast but it isn’t my main requirement. Data pipelines get scheduled and run in the background; of course when workloads scale up that performance pays off in compute costs.

For me I love working in Julia because of the language ergonomics. Its just a pleasure to work with, very well thought out. It makes me feel smarter, somehow I can read the Base library and understand it. Package management is great I should add, and there aren’t 30 ways to do it like in python. sbt in scala is a nightmare somehow.

So yeah, Julia is my #1 choice for all programming tasks and this data engineering side really needs some work.

7 Likes

It also supports hive partitioning. The documentation is somewhat scrambled, but see here.

So in your case you could maybe do something like this?

SELECT
    d.order_date,
    d.region,
    d.product_line,
    d.col_1,
    d.col_2,
    d.col_etc
FROM read_parquet('s3://our_datalake/orders_dataset/region=*/order_date=*/product_line=*/*.parquet', HIVE_PARTITIONING = 1) as d
WHERE
    d.region = 1 and
    d.product_line = [22, 40, 121] and
    d.order_date >= '2022-01-01'

As for the paths it’s true that it is not quite the same as your spark example. It seems like DuckDB does not support walking arbitrary directory structures. You have to know the directory structure or at least the depth beforehand. For example you could put .../*/*/*/*.parquet instead of region=*/order_date=*/product_line=*/*.parquet.

1 Like

Awesome! I should have looked through the rest of the docs, I was focused on my Julia problems.

This solved 90% of my use cases, pending some concrete tests in my workflow. I can’t wait to try this!!!

Can anybody see potential problems with adding DuckDB.jl as a dependency to all my data pipeline jobs in Julia? I would probably create a utility function to wrap the SQL commands for read/write into julia function with arguments, but other than that it seems pretty good.

Edit:
I’ll need to take care on inferred column types perhaps, as in d.order_date >= '2022-01-01' may not read order_date as a date type and fail on string comparison. This requires attention in Spark as well though so it is a wash.

The DuckDB.jl package is not as heavily developed as the Python/R packages. Lack of attention is the only thing holding it back.

5 Likes

I agree with @tbeason about the Julia package, but in general I find it very usable. In fact I use it all the time for my data engineering workflows.

DuckDB itself is still version < 1, so that might be an issue depending on your requirements.

3 Likes

@merlin I agree with the sentiments about data pipelines in general are much less mature than let’s say what’s available in Python. Like seamlessly accessing parquet files from Azure Blob Storage / S3.

Having said that DuckDB.jl solves a lot of the issues with parquet files. Eg. reading/querying parquet files directly from S3. For some reason I don’t see a lot of adoption of DuckDB in the Julia ecosystem. Probably that’s the reason it’s feature parity vis-a-vis Python has started to lag.

In defense of Matlab, there is usually only one way to write code. Yes, in most cases it is ugly, but only one option. However, in Julia you are overwhelmed with plenty of possibilities.

  1. In Julia you think of types and containers, and of different styles for their manipulation. Should I use structures / tuples / named tuples / dicts? In Matlab all numbers are double matrices, and all structs are dicts.

  2. In Julia, should I write for loops / map / broadcast (dot) / array comprehension?

  3. How should I organize code? Files are independent of modules, so I should think of them both. Modules can be nested, I should think of module API and namespace pollution with import / using and so on. In Matlab there is simpler code search mechanism for function-files.

  4. What package should I choose for plotting (Plots, Makie, etc.), for tables (native types with Tables, StructArrays, DataFrames, etc.), and so on.

  5. Also, It is very interesting observation by @JanKap that for every new package in Julia you should learn its DSL. In Matlab things are somehow easier - you just write functions that just rely on basic data types and don’t have to learn 3rd-party structures, objects and so on.

Another tiny advantage, I find it way easier to use Matlab command line with mouse-friendly cursor and line selection.

In general, what I am missing in Julia, is some kind of guideline enforcing that reduces code brittleness and frees my mind.

1 Like

It is like someone only has one finger and he does everything using that finger and says ahhhha it is so easy to do everything using only one finger because it is no braining!

7 Likes

Actually, I feel quite opposite - having too many fingers and taking time to think of which should I use.

3 Likes

Then I am wondering how those words are typed. Do you use only one finger? What I am trying to convey is that flexibility enables possibilities. The more you practice, the better chance you will form your own style of coding in Julia no longer confusing. Remember how you learned to type with your ten fingers? At first, it is really challenging and frustrated. But once you master it, you should type much faster than using only one finger.

2 Likes

I looked at DuckDB to replace SQLite in some usecases, but their Julia package seems (1) not well-integrated, eg regarding julia tables support, (2) heavy in terms of dependencies (48 deps, many are nontrivial), and (3) as I understand, doesn’t use basic julia testing practices.
I didn’t even manage to run their tests when trying to improve (1):

Maybe they just don’t consider julia integration that important, idk.

It doesn’t seem like that to me. You have options in Matlab, too.

In Matlab you can use structs, classes, tables, dictionaries or arrays. You need to decide whether to make your code OOP or imperative. There are also now listeners in Matlab, which means a lot of code is reactive, with hooks and callback functions, etc. It’s a very different style from ‘old-fashioned matlab’.

(That’s aside from the fact that Matlab has both single, logical, int8, uint8, int16, uint16, etc. etc.)

Matlab: Loops, vectorized/bsxfun or arrayfun/cellfun/structfun. Matlab is jit-compiled, so loops are not quite as bad anymore.

In Matlab you can have a jumbled heap of files in folders and mess around with path, or you can create +packages, and deal with namespaces. Of course, there are also class hierarchies, which you must organize.

In Matlab the choice is straightforward. You sort of insinuate that every datatype needs its own plotting library in Julia, but multiple dispatch and plot recipies means that is not a fair description.

I don’t understand this. Some packages are DSL-like, but the vast majority is not. You just use standard Julia syntax with regular functions. You need to learn each package’s API, but that’s exactly the same in Matlab.

8 Likes

You could also phrase that differently: If you have an image (a matrix) where every pixel (entry of that) is a matrix itself (was my application). In Matlab you have to use 4D arrays. And depending on the situation you reshape/permute like crazy.
So there Matlab enforces you to “squeeze” everything you have into Matrices.

Julia allows you to use a data structure that fits, namely Matrix{Matrix{Float64}}. Which for me – having the same concept in code as in my mind (without permuting/reshaping for speed) – frees my mind.

My personal experience with anything that is not matrices (an actual data structure / object) in Matlab, is that the code style is usually a bit messy. And then you have to most probably learn other peoples structures.

But where you are correct is, that if you are very much used to one way of doing it, switching to a different one takes time.
That does not compare to “typing with one finger” (which is a bit mean), but more like:
For small changes – changing the Kezboard-Lazout (me switching form German to English La(y/z)out)
For larger ones – talking in another language. It takes time to get used to and only using / practicing it helps.

1 Like

Fair enough. Julia is a young language with many features, with new features being added, and sometimes, being discovered as a pleasant side-effect of the language philosphy (I’m looking at you Holy Traits). So there hasn’t been a lot of “standardization” yet. I’m still figuring things out; for example, I just had to learn about ‘Base.@invokelatest’ being required if I want to build a function on the fly and use it in the middle of another function.

That being said, “Hands-On Design Patterns and Best Practices with Julia” was a really helpful book when I started building bigger software solutions. In addition, because so many Julia packages are written in Julia and are open source, it’s really easy to get a handle on what good Julia code is supposed to look like. Although it’s important to know that there are some style differences with high level and low level code (high level should be more flexible, and low level should be more performant).

1 Like
1 Like