Importing big data

I don’t have PostgreSQL at hand to test it, but it should look something like this (assuming you have already install Spark.jl):

  1. Edit jvm/sparkjl/pom.xml and add the following to dependencies section:
<!-- https://mvnrepository.com/artifact/org.postgresql/postgresql -->
<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
    <version>42.1.4</version>
</dependency>

In Java world, pom.xml is a single place you put all your dependencies. Since basic Spark installation doesn’t support PostgreSQL driver, we need to add it to the Java CLASSPATH. There are other ways to do it, but find this one pretty simple for basic use cases.

  1. Run Pkg.build("Spark") for changes to take effect.

  2. Create SparkSession:

using Spark
Spark.init()
sess = SparkSession()  # uses "local" master
  1. Read Spark’s Dataset using JDBC format:
options = Dict(
    "url" => "jdbc:postgresql:dbserver",
    "dbtable" => "schema.tablename",
    "user" => "username",
    "password" => "password")
df = read_df(sess, "";  format="jdbc", options=options)

Converting Spark Dataset / DataFrame to Julia DataFrame isn’t supported out of the box yet, but you can:

  • export Spark dataset to CSV and read it from DataFrames.jl
  • call collect(spark_df) to get a list of rows and then build a Julia DataFrame

Issues on GitHub are also welcome. Spark API is really huge, so instead of randomly implementing parts of it I expect users of Spark.jl to create issues so I could prioritize and plan them.

2 Likes

At some point when the data ecosystem stabilizes some I’d be happy to make PR’s for DataFrames support (still waiting for latest to be tagged).

1 Like