[ANN] Spark.jl, reborn

I’m pleased to announce a new release of Spark.jl - Julia interface to Apache Spark.

Apache Spark is a ubiquitous distributed data processing framework used by thousands of organizations for large scale data engineering, data science and machine learning. Its recent versions feature a powerful SQL / DataFrame engine as well as structured streaming with millisecond-level latency. Version 0.6 of Spark.jl brings these capabilities to Julia.

Github | Documentation

A few things you can do with Spark.jl:

Read/Write JSON, CSV, Parquet, ORC and other formats

using Spark

spark = SparkSession.builder.master("local").appName("Main").getOrCreate()

# read JSON & write Parquet
df = spark.read.json(joinpath(dirname(pathof(Spark)), "../test/data/people.json"))
df.write.parquet("people.parquet")

# read Parquet and write CSV
df = spark.read.parquet("people.parquet")
df.write.option("header", true).csv("people.csv")

Split-Apply-Combine

data = [
    ["red", "banana", 1, 10], ["blue", "banana", 2, 20], ["red", "carrot", 3, 30],
    ["blue", "grape", 4, 40], ["red", "carrot", 5, 50], ["black", "carrot", 6, 60],
    ["red", "banana", 7, 70], ["red", "grape", 8, 80]
]
sch = ["color string", "fruit string", "v1 long", "v2 long"]
df = spark.createDataFrame(data, sch)
# +-----+------+---+---+
# |color| fruit| v1| v2|
# +-----+------+---+---+
# |  red|banana|  1| 10|
# | blue|banana|  2| 20|
# |  red|carrot|  3| 30|
# | blue| grape|  4| 40|
# |  red|carrot|  5| 50|
# |black|carrot|  6| 60|
# |  red|banana|  7| 70|
# |  red| grape|  8| 80|
# +-----+------+---+---+

gdf = df.groupby("fruit")

df.mean("v1")
# +------+------------------+
# | fruit|           avg(v1)|
# +------+------------------+
# | grape|               6.0|
# |banana|3.3333333333333335|
# |carrot| 4.666666666666667|
# +------+------------------+

gdf.agg(min(df.v1), max(df.v2))

# +------+-------+-------+
# | fruit|min(v1)|max(v2)|
# +------+-------+-------+
# | grape|      4|     80|
# |banana|      1|     70|
# |carrot|      3|     60|
# +------+-------+-------+

Real-time stream processing

Spark.jl also supports streaming data sources such as Kafka or, say, networks streams.To try the later, in a Terminal start Netcat and send a few lines of text:

nc -lk 9999
Julia was designed from the beginning for high performance
Julia is dynamically typed, feels like a scripting language

In Julia, read this stream and keep the running word count:

# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark.
    readStream.
    format("socket").
    option("host", "localhost").
    option("port", 9999).
    load()

# Split the lines into words
words = lines.select(
    lines.value.split(" ").explode().alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

query = wordCounts.
    writeStream.
    outputMode("complete").
    format("console").
    start()

query.awaitTermination()
# -------------------------------------------
# Batch: 0
# -------------------------------------------
# +----+-----+
# |word|count|
# +----+-----+
# +----+-----+
# 
# -------------------------------------------
# Batch: 1
# -------------------------------------------
# +-----------+-----+
# |       word|count|
# +-----------+-----+
# |        was|    1|
# |        for|    1|
# |  beginning|    1|
# |      Julia|    2|
# |         is|    1|
# |   designed|    1|
# |        the|    1|
# |       high|    1|
# |   language|    1|
# |       from|    1|
# |       like|    1|
# |  scripting|    1|
# |      feels|    1|
# |     typed,|    1|
# |performance|    1|
# |          a|    1|
# |           |    1|
# |dynamically|    1|
# +-----------+-----+

Not in this release

Introducing this new API required rewriting around 95% of code. As you may guess, some features were lost along the way. Here’s what is not included in this release:

  • RDD interface. Although RDD is still an important internal abstraction of Apache Spark, from a user perspective it has mostly been replaced with the DataFrame interface. Adding RDD back is possible, but not planned by default.
  • Ability to run custom Julia code. Spark is actually a pretty terrible platform for running non-JVM workers. There’s some work towards Julia UDFs, but due to multitude of use cases it’s almost impossible to create a single implementation that fits them all. However, we do plan to add limited support for a few principal scenarios, just not in this initial release. Also, for running completely custom Julia code on a cluster nowadays I tend to suggest Kubernetes instead of Spark. Anyway, ideas and use cases on the topic are welcome.
  • Integrations with other Julia packages, e.g. Arrow.jl and DataFrames.jl. Hopefully, we will get them back soon.
21 Likes