I don’t have PostgreSQL at hand to test it, but it should look something like this (assuming you have already install Spark.jl):
- Edit
jvm/sparkjl/pom.xml
and add the following todependencies
section:
<!-- https://mvnrepository.com/artifact/org.postgresql/postgresql -->
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.1.4</version>
</dependency>
In Java world, pom.xml
is a single place you put all your dependencies. Since basic Spark installation doesn’t support PostgreSQL driver, we need to add it to the Java CLASSPATH
. There are other ways to do it, but find this one pretty simple for basic use cases.
-
Run
Pkg.build("Spark")
for changes to take effect. -
Create
SparkSession
:
using Spark
Spark.init()
sess = SparkSession() # uses "local" master
- Read Spark’s
Dataset
using JDBC format:
options = Dict(
"url" => "jdbc:postgresql:dbserver",
"dbtable" => "schema.tablename",
"user" => "username",
"password" => "password")
df = read_df(sess, ""; format="jdbc", options=options)
Converting Spark Dataset / DataFrame to Julia DataFrame isn’t supported out of the box yet, but you can:
- export Spark dataset to CSV and read it from DataFrames.jl
- call
collect(spark_df)
to get a list of rows and then build a JuliaDataFrame
Issues on GitHub are also welcome. Spark API is really huge, so instead of randomly implementing parts of it I expect users of Spark.jl to create issues so I could prioritize and plan them.