Struggling with Julia and large datasets

Generally good advice. Hmmm, perhaps it’s time to seriously start on my DiskFrames.jl project after I finish updating JDF.jl. It will be Julia-implementation of R’s {disk.frame} for working with large datasets.

JuliaDB.jl had lots of promising, but it kinda fell away I think.

1 Like

If you cross Julia-version compatibility is not an issue at the moment, then I JDF.jl works pretty well for me. I wrote it so I can save/load tabular data fast.

Hi,

If you havent done so already (I couldn’t find it) can I suggest that you do a package announcement for this - it’s pretty huge imo having just checked it. Including arrow comparability in your release is really impressive and forward thinking.

Have you chained the SQL querying with normal Julia after the fact i.e. query into dataframe then act with normal Julia? How does that work with collect - is the code distributed “as normal” with spark?

I may have a backend re write to do…

Regards,

Thanks! I did look into Spark and it involves more setup and coding than I wanted to do for this prototype. What I liked about Drill is that it’s just filesystem and run queries, no real setup needed. As we move out of prototype and into something that needs real performance, we’ll have to re-evaluate the decisions made so far.

1 Like

Hi djholiver,

SparkSQL.jl is designed to make Julia lang the easiest and most optimal way to use Apache Spark for tabular data. Your use case is works and there is a new tutorial on how to do that here:

Here’s a quick example:

JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)

.
In the above example we created a Julia DataFrame, used it in a query on Apache Spark, brought the query results back to a Julia DataFrame and then used the describe function in Julia.

The SparkSQL.jl official documentation:

Hi sampope,

SparkSQL.jl is the easiest way to use Apache Spark. There is no need to learn Scala, Python, or Java. All you need to know is Julia and SQL. The SparkSQL.jl package supports queries on the file system too:

file = sql(sprk, "SELECT _c0 AS TICKER, to_date(_c1, 'yyyymmdd') AS TRADING_DATE, CAST(_c5 AS double) AS CLOSE FROM CSV.`/tmp/sampleData.csv`;");
createOrReplaceTempView(file, "stocks");
write = sql(sprk, "CREATE TABLE IF NOT EXISTS stocks USING PARQUET LOCATION '/tmp/stocks.PARQUET' AS SELECT * FROM stocks;")

.

The above code converts a CSV file to Parquet format in just three lines. The full tutorial can be found here:

As your requirements grow, SparkSQL.jl can help. The SparkSQL.jl package is designed to support many advanced features including Delta Lake. Delta Lake architecture is a best practice for multi-petabyte and trillion+ row datasets.

To learn more visit the official project page of SparkSQL.jl:

2 Likes

If you have discovered it yet, take a look into Floops.jl and the @Floops macro. Perhaps an easy way to get some parallel processing into your code.

Here is a video of Prof. Alan Edelman doing a demo of Floops:
Parallel processing on your own machine

One of the things I admire about Prof. Edelman is the enthusiasm for everything he does.

3 Likes