Importing big data

dfdx · November 13, 2017, 11:07pm

I don’t have PostgreSQL at hand to test it, but it should look something like this (assuming you have already install Spark.jl):

Edit jvm/sparkjl/pom.xml and add the following to dependencies section:

<!-- https://mvnrepository.com/artifact/org.postgresql/postgresql -->
<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
    <version>42.1.4</version>
</dependency>

In Java world, pom.xml is a single place you put all your dependencies. Since basic Spark installation doesn’t support PostgreSQL driver, we need to add it to the Java CLASSPATH. There are other ways to do it, but find this one pretty simple for basic use cases.

Run Pkg.build("Spark") for changes to take effect.
Create SparkSession:

using Spark
Spark.init()
sess = SparkSession()  # uses "local" master

Read Spark’s Dataset using JDBC format:

options = Dict(
    "url" => "jdbc:postgresql:dbserver",
    "dbtable" => "schema.tablename",
    "user" => "username",
    "password" => "password")
df = read_df(sess, "";  format="jdbc", options=options)

Converting Spark Dataset / DataFrame to Julia DataFrame isn’t supported out of the box yet, but you can:

export Spark dataset to CSV and read it from DataFrames.jl
call collect(spark_df) to get a list of rows and then build a Julia DataFrame

Issues on GitHub are also welcome. Spark API is really huge, so instead of randomly implementing parts of it I expect users of Spark.jl to create issues so I could prioritize and plan them.

Topic		Replies	Views
Julia import large out-of-memory csv data General Usage question	11	2331	June 29, 2017
Storing big data file for fast access? General Usage question , big-data , io	8	2341	February 10, 2021
Package for reading/writing ~100GB data files General Usage	10	2879	November 17, 2018
Struggling with Julia and large datasets General Usage question , big-data	67	11052	October 17, 2024
Reading huge csv files Data	5	4071	January 19, 2019

Importing big data

Related topics