This post is announcing the release of SparkSQL.jl version 1.3.0. SparkSQL.jl is software that enables developers to use the Julia programming language with the Apache Spark data processing engine.
Apache Spark is one of the world’s most ubiquitous open-source big data processing engines. Spark’s distributed processing power enables it to process very large datasets. Apache Spark runs on many platforms and hardware architectures including those used by large enterprise and government.
Released in 2012, Julia is a modern programming language ideally suited for data science and machine learning workloads. Expertly designed, Julia is a highly performant language. It sports multiple-dispatch, auto-differentiation and a rich ecosystem of packages.
Use Case
SparkSQL.jl submits Structured Query Language (SQL), Data Manipulation Language (DML) and Data Definition Language (DDL) statements to Apache Spark. It has functions to move data from Spark into Julia DataFrames and Julia DataFrame data into Spark.
SparkSQL.jl delivers advanced features like dynamic horizontal autoscaling that scale compute nodes to match workload requirements. This package supports structured and semi-structured data in Data Lakes, Lakehouses (Delta Lake, Iceberg) on premise and in the cloud. To maximize java virtual machine performance, SparkSQL.jl brings support for the latest Java JDK-17 to Spark 3.2.0.
New features of this release are:
- Julia version 1.7 support.
- DataFrames 1.3.0 support.
Install SparkSQL.jl via the Julia REPL:
] add SparkSQL
Update from earlier releases of SparkSQL.jl via the Julia REPL:
] update SparkSQL
Example usage:
JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)
To learn more visit the Official Project Page:
https://github.com/propelledanalytics/SparkSQL.jl
Official Tutorials and Project Blog: