I’m pleased to announce a new release of Spark.jl - Julia interface to Apache Spark.
Apache Spark is a ubiquitous distributed data processing framework used by thousands of organizations for large scale data engineering, data science and machine learning. Its recent versions feature a powerful SQL / DataFrame engine as well as structured streaming with millisecond-level latency. Version 0.6 of Spark.jl brings these capabilities to Julia.
A few things you can do with Spark.jl:
Read/Write JSON, CSV, Parquet, ORC and other formats
using Spark
spark = SparkSession.builder.master("local").appName("Main").getOrCreate()
# read JSON & write Parquet
df = spark.read.json(joinpath(dirname(pathof(Spark)), "../test/data/people.json"))
df.write.parquet("people.parquet")
# read Parquet and write CSV
df = spark.read.parquet("people.parquet")
df.write.option("header", true).csv("people.csv")
Split-Apply-Combine
data = [
["red", "banana", 1, 10], ["blue", "banana", 2, 20], ["red", "carrot", 3, 30],
["blue", "grape", 4, 40], ["red", "carrot", 5, 50], ["black", "carrot", 6, 60],
["red", "banana", 7, 70], ["red", "grape", 8, 80]
]
sch = ["color string", "fruit string", "v1 long", "v2 long"]
df = spark.createDataFrame(data, sch)
# +-----+------+---+---+
# |color| fruit| v1| v2|
# +-----+------+---+---+
# | red|banana| 1| 10|
# | blue|banana| 2| 20|
# | red|carrot| 3| 30|
# | blue| grape| 4| 40|
# | red|carrot| 5| 50|
# |black|carrot| 6| 60|
# | red|banana| 7| 70|
# | red| grape| 8| 80|
# +-----+------+---+---+
gdf = df.groupby("fruit")
df.mean("v1")
# +------+------------------+
# | fruit| avg(v1)|
# +------+------------------+
# | grape| 6.0|
# |banana|3.3333333333333335|
# |carrot| 4.666666666666667|
# +------+------------------+
gdf.agg(min(df.v1), max(df.v2))
# +------+-------+-------+
# | fruit|min(v1)|max(v2)|
# +------+-------+-------+
# | grape| 4| 80|
# |banana| 1| 70|
# |carrot| 3| 60|
# +------+-------+-------+
Real-time stream processing
Spark.jl also supports streaming data sources such as Kafka or, say, networks streams.To try the later, in a Terminal start Netcat and send a few lines of text:
nc -lk 9999
Julia was designed from the beginning for high performance
Julia is dynamically typed, feels like a scripting language
In Julia, read this stream and keep the running word count:
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark.
readStream.
format("socket").
option("host", "localhost").
option("port", 9999).
load()
# Split the lines into words
words = lines.select(
lines.value.split(" ").explode().alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
query = wordCounts.
writeStream.
outputMode("complete").
format("console").
start()
query.awaitTermination()
# -------------------------------------------
# Batch: 0
# -------------------------------------------
# +----+-----+
# |word|count|
# +----+-----+
# +----+-----+
#
# -------------------------------------------
# Batch: 1
# -------------------------------------------
# +-----------+-----+
# | word|count|
# +-----------+-----+
# | was| 1|
# | for| 1|
# | beginning| 1|
# | Julia| 2|
# | is| 1|
# | designed| 1|
# | the| 1|
# | high| 1|
# | language| 1|
# | from| 1|
# | like| 1|
# | scripting| 1|
# | feels| 1|
# | typed,| 1|
# |performance| 1|
# | a| 1|
# | | 1|
# |dynamically| 1|
# +-----------+-----+
Not in this release
Introducing this new API required rewriting around 95% of code. As you may guess, some features were lost along the way. Here’s what is not included in this release:
- RDD interface. Although RDD is still an important internal abstraction of Apache Spark, from a user perspective it has mostly been replaced with the DataFrame interface. Adding RDD back is possible, but not planned by default.
- Ability to run custom Julia code. Spark is actually a pretty terrible platform for running non-JVM workers. There’s some work towards Julia UDFs, but due to multitude of use cases it’s almost impossible to create a single implementation that fits them all. However, we do plan to add limited support for a few principal scenarios, just not in this initial release. Also, for running completely custom Julia code on a cluster nowadays I tend to suggest Kubernetes instead of Spark. Anyway, ideas and use cases on the topic are welcome.
- Integrations with other Julia packages, e.g. Arrow.jl and DataFrames.jl. Hopefully, we will get them back soon.