This post is announcing the release of SparkSQL.jl version 1.2.0.
SparkSQL.jl is software that enables Julia programs to work with Apache Spark using just SQL.
Apache Spark is one of the world’s most ubiquitous open-source big data processing engines. Spark’s distributed processing power enables it to process very large datasets. Apache Spark runs on many platforms and hardware architectures including those used by large enterprise and government.
Released in 2012, Julia is a modern programming language ideally suited for data science and machine learning workloads. Expertly designed, Julia is a highly performant language. It sports multiple-dispatch, auto-differentiation and a rich ecosystem of packages.
SparkSQL.jl provides the functionality that enables using Apache Spark and Julia together for tabular data. With SparkSQL.jl, Julia takes the place of Python for data science and machine learning work on Spark.
New features of this release are:
- Kubernetes support.
- Apache Spark 3.2.0 support.
- JDK 17 support for Spark 3.2.0 on SparkSQL.jl kubernetes.
Install SparkSQL.jl via the Julia REPL:
] add SparkSQL
Update from earlier releases of SparkSQL.jl via the Julia REPL:
] update SparkSQL
Example usage:
JuliaDataFrame = DataFrame(tickers = ["CRM", "IBM"])
onSpark = toSparkDS(sprk, JuliaDataFrame)
createOrReplaceTempView(onSpark, "julia_data")
query = sql(sprk, "SELECT * FROM spark_data WHERE TICKER IN (SELECT * FROM julia_data)")
results = toJuliaDF(query)
describe(results)
Official Project Page:
https://github.com/propelledanalytics/SparkSQL.jl
To learn more, visit the project blog and tutorials page: