I don’t have PostgreSQL at hand to test it, but it should look something like this (assuming you have already install Spark.jl):
- Edit
jvm/sparkjl/pom.xmland add the following todependenciessection:
<!-- https://mvnrepository.com/artifact/org.postgresql/postgresql -->
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.1.4</version>
</dependency>
In Java world, pom.xml is a single place you put all your dependencies. Since basic Spark installation doesn’t support PostgreSQL driver, we need to add it to the Java CLASSPATH. There are other ways to do it, but find this one pretty simple for basic use cases.
-
Run
Pkg.build("Spark")for changes to take effect. -
Create
SparkSession:
using Spark
Spark.init()
sess = SparkSession() # uses "local" master
- Read Spark’s
Datasetusing JDBC format:
options = Dict(
"url" => "jdbc:postgresql:dbserver",
"dbtable" => "schema.tablename",
"user" => "username",
"password" => "password")
df = read_df(sess, ""; format="jdbc", options=options)
Converting Spark Dataset / DataFrame to Julia DataFrame isn’t supported out of the box yet, but you can:
- export Spark dataset to CSV and read it from DataFrames.jl
- call
collect(spark_df)to get a list of rows and then build a JuliaDataFrame
Issues on GitHub are also welcome. Spark API is really huge, so instead of randomly implementing parts of it I expect users of Spark.jl to create issues so I could prioritize and plan them.