Struggling with Julia and large datasets

Generally good advice. Hmmm, perhaps it’s time to seriously start on my DiskFrames.jl project after I finish updating JDF.jl. It will be Julia-implementation of R’s {disk.frame} for working with large datasets.

JuliaDB.jl had lots of promising, but it kinda fell away I think.

1 Like

If you cross Julia-version compatibility is not an issue at the moment, then I JDF.jl works pretty well for me. I wrote it so I can save/load tabular data fast.

Hi,

If you havent done so already (I couldn’t find it) can I suggest that you do a package announcement for this - it’s pretty huge imo having just checked it. Including arrow comparability in your release is really impressive and forward thinking.

Have you chained the SQL querying with normal Julia after the fact i.e. query into dataframe then act with normal Julia? How does that work with collect - is the code distributed “as normal” with spark?

I may have a backend re write to do…

Regards,

Thanks! I did look into Spark and it involves more setup and coding than I wanted to do for this prototype. What I liked about Drill is that it’s just filesystem and run queries, no real setup needed. As we move out of prototype and into something that needs real performance, we’ll have to re-evaluate the decisions made so far.

1 Like

Hi djholiver,

SparkSQL.jl is designed to make Julia lang the easiest and most optimal way to use Apache Spark for tabular data. Your use case is works and there is a new tutorial on how to do that here:

Here’s a quick example:

JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)

.
In the above example we created a Julia DataFrame, used it in a query on Apache Spark, brought the query results back to a Julia DataFrame and then used the describe function in Julia.

The SparkSQL.jl official documentation:

Hi sampope,

SparkSQL.jl is the easiest way to use Apache Spark. There is no need to learn Scala, Python, or Java. All you need to know is Julia and SQL. The SparkSQL.jl package supports queries on the file system too:

file = sql(sprk, "SELECT _c0 AS TICKER, to_date(_c1, 'yyyymmdd') AS TRADING_DATE, CAST(_c5 AS double) AS CLOSE FROM CSV.`/tmp/sampleData.csv`;");
createOrReplaceTempView(file, "stocks");
write = sql(sprk, "CREATE TABLE IF NOT EXISTS stocks USING PARQUET LOCATION '/tmp/stocks.PARQUET' AS SELECT * FROM stocks;")

.

The above code converts a CSV file to Parquet format in just three lines. The full tutorial can be found here:

As your requirements grow, SparkSQL.jl can help. The SparkSQL.jl package is designed to support many advanced features including Delta Lake. Delta Lake architecture is a best practice for multi-petabyte and trillion+ row datasets.

To learn more visit the official project page of SparkSQL.jl:

2 Likes

If you have discovered it yet, take a look into Floops.jl and the @Floops macro. Perhaps an easy way to get some parallel processing into your code.

Here is a video of Prof. Alan Edelman doing a demo of Floops:
Parallel processing on your own machine

One of the things I admire about Prof. Edelman is the enthusiasm for everything he does.

3 Likes

It’s three years later, but this may help others with similar problems.

First, I was a co-founder of Apache Drill and I love that it provided value for what you needed. That said, I would strongly suggest DuckDB for this kind of problem. The level of complexity for the setup is very low and it should be able to make use of the cores available on a single machine in a very efficient way. Drill pays a penalty for transferring data (as would Spark) because it is designed to work across a relatively large number of machines. Duck makes the best use of a single machine and often can provide performance equal to what Drill, Spark or Presto achieve on a dozen machines.

In addition, it is important to store your data in a columnar form that provides good compression. Parquet is the right choice there, but you should investigate sorting your data to maximize row to row coherence. Typically, this means sorting by data source, then by time. This can lead to massive compression (I have seen 300:1 on weather data) and that, in turn, leads to stunning performance for queries.

Duck also integrates directly into Julia very nicely. If your final results are small enough to fit into memory, then Duck query results can become DataFrames with very little overhead.

10 Likes