[ANN] Spark.jl, reborn

dfdx · June 5, 2022, 3:51pm

I’m pleased to announce a new release of Spark.jl - Julia interface to Apache Spark.

Apache Spark is a ubiquitous distributed data processing framework used by thousands of organizations for large scale data engineering, data science and machine learning. Its recent versions feature a powerful SQL / DataFrame engine as well as structured streaming with millisecond-level latency. Version 0.6 of Spark.jl brings these capabilities to Julia.

Github | Documentation

A few things you can do with Spark.jl:

Read/Write JSON, CSV, Parquet, ORC and other formats

using Spark

spark = SparkSession.builder.master("local").appName("Main").getOrCreate()

# read JSON & write Parquet
df = spark.read.json(joinpath(dirname(pathof(Spark)), "../test/data/people.json"))
df.write.parquet("people.parquet")

# read Parquet and write CSV
df = spark.read.parquet("people.parquet")
df.write.option("header", true).csv("people.csv")

Split-Apply-Combine

data = [
    ["red", "banana", 1, 10], ["blue", "banana", 2, 20], ["red", "carrot", 3, 30],
    ["blue", "grape", 4, 40], ["red", "carrot", 5, 50], ["black", "carrot", 6, 60],
    ["red", "banana", 7, 70], ["red", "grape", 8, 80]
]
sch = ["color string", "fruit string", "v1 long", "v2 long"]
df = spark.createDataFrame(data, sch)
# +-----+------+---+---+
# |color| fruit| v1| v2|
# +-----+------+---+---+
# |  red|banana|  1| 10|
# | blue|banana|  2| 20|
# |  red|carrot|  3| 30|
# | blue| grape|  4| 40|
# |  red|carrot|  5| 50|
# |black|carrot|  6| 60|
# |  red|banana|  7| 70|
# |  red| grape|  8| 80|
# +-----+------+---+---+

gdf = df.groupby("fruit")

df.mean("v1")
# +------+------------------+
# | fruit|           avg(v1)|
# +------+------------------+
# | grape|               6.0|
# |banana|3.3333333333333335|
# |carrot| 4.666666666666667|
# +------+------------------+

gdf.agg(min(df.v1), max(df.v2))

# +------+-------+-------+
# | fruit|min(v1)|max(v2)|
# +------+-------+-------+
# | grape|      4|     80|
# |banana|      1|     70|
# |carrot|      3|     60|
# +------+-------+-------+

Real-time stream processing

Spark.jl also supports streaming data sources such as Kafka or, say, networks streams.To try the later, in a Terminal start Netcat and send a few lines of text:

nc -lk 9999
Julia was designed from the beginning for high performance
Julia is dynamically typed, feels like a scripting language

In Julia, read this stream and keep the running word count:

# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark.
    readStream.
    format("socket").
    option("host", "localhost").
    option("port", 9999).
    load()

# Split the lines into words
words = lines.select(
    lines.value.split(" ").explode().alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

query = wordCounts.
    writeStream.
    outputMode("complete").
    format("console").
    start()

query.awaitTermination()
# -------------------------------------------
# Batch: 0
# -------------------------------------------
# +----+-----+
# |word|count|
# +----+-----+
# +----+-----+
# 
# -------------------------------------------
# Batch: 1
# -------------------------------------------
# +-----------+-----+
# |       word|count|
# +-----------+-----+
# |        was|    1|
# |        for|    1|
# |  beginning|    1|
# |      Julia|    2|
# |         is|    1|
# |   designed|    1|
# |        the|    1|
# |       high|    1|
# |   language|    1|
# |       from|    1|
# |       like|    1|
# |  scripting|    1|
# |      feels|    1|
# |     typed,|    1|
# |performance|    1|
# |          a|    1|
# |           |    1|
# |dynamically|    1|
# +-----------+-----+

Not in this release

Introducing this new API required rewriting around 95% of code. As you may guess, some features were lost along the way. Here’s what is not included in this release:

RDD interface. Although RDD is still an important internal abstraction of Apache Spark, from a user perspective it has mostly been replaced with the DataFrame interface. Adding RDD back is possible, but not planned by default.
Ability to run custom Julia code. Spark is actually a pretty terrible platform for running non-JVM workers. There’s some work towards Julia UDFs, but due to multitude of use cases it’s almost impossible to create a single implementation that fits them all. However, we do plan to add limited support for a few principal scenarios, just not in this initial release. Also, for running completely custom Julia code on a cluster nowadays I tend to suggest Kubernetes instead of Spark. Anyway, ideas and use cases on the topic are welcome.
Integrations with other Julia packages, e.g. Arrow.jl and DataFrames.jl. Hopefully, we will get them back soon.

Topic		Replies	Views
[ANN] SparkSQL.jl release 1.1.0 Package Announcements	0	415	September 1, 2021
[ANN] SparkSQL.jl release 1.3.0 Package Announcements announcement	0	515	December 9, 2021
[ANN] SparkSQL.jl release 1.0.0 Package Announcements	2	695	June 19, 2021
[ANN] SparkSQL.jl release 1.2.0 Package Announcements announcement	0	291	November 15, 2021
When will Julia compete with Spark? Julia at Scale announcement , spark	16	8683	June 5, 2021

[ANN] Spark.jl, reborn

Read/Write JSON, CSV, Parquet, ORC and other formats

Split-Apply-Combine

Real-time stream processing

Not in this release

Related topics