Setting up Julia on Spark on AWS EMR

Hi folks,

I want to use Julia on Spark on AWS EMR as this is one of the requirement at my workplace.
I tried to follow the steps from Julia & Spark

but I was not able to succeed. If anyone has implemented it on AWS EMR, could you please help me or could you please document the steps?

Documenting how I was able to initialize a Spark session below for your reference.

    I followed your steps on EMR Master node to download the binaries and use the REPL.

    #Download binaries for Julia
    wget -S -T 10 -t 5 https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.1-linux-x86_64.tar.gz

    #extract the binaries
    tar xvfz julia-1.6.0-linux-x86_64.tar.gz -C ~
    cd julia-1.6.0-linux-x86_64

    #install maven to add package Spark to Julia
    sudo yum install -y maven

    # Set required environment variables
    export HADOOP_HOME=/usr/lib/hadoop
    export SPARK_HOME=/usr/lib/spark/
    export HADOOP_CONF_DIR=/etc/hadoop/conf

    # Start Julia REPL
    ./bin/julia

    > using Pkg
    > Pkg.add("Spark")
    > using Spark
    > Spark.init()
    > sc = SparkContext(master="yarn")

The above does give me a YARN application or Spark session but as soon as I tried to execute some code it is throwing log4j related library mismatch errors.

Thanks and Regards,
Sumit Malbari
617-955-3382
malbarisumit@gmail.com

You can specify alternative versions with SPARK_VERSION, SCALA_VERSION and SCALA_BINARY_VERSION environment variables, e.g.:

julia> ENV["SPARK_VERSION"] = "2.5.0"
...
pkg> build Spark
...

Versions of other components can also be modified using Maven properies (see the list of properties and how to override them), but you will need to clone the Spark.jl repo and build it manually from the command line for this to work.

Hi dfdx,

Launched EMR cluster version EMR-5.33.0 with Spark selected.

I have tried both versions of Julia listed below
* wget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6-latest-linux-x86_64.tar.gz
* wget https://julialang-s3.julialang.org/bin/linux/x64/1.4/julia-1.4-latest-linux-x86_64.tar.gz

    - Extract file to home
            tar xvfz julia-1.6.0-linux-x86_64.tar.gz -C ~

    - Install Maven for supporting adding Spark to Julia
            sudo yum install -y maven

    - Set environment variables as listed below
            export HADOOP_HOME=/usr/lib/hadoop
            export SPARK_HOME=/usr/lib/spark/
            export HADOOP_CONF_DIR=/etc/hadoop/conf
            export JULIA_COPY_STACKS=1

    - Create julia startup configuration directory to set Env variable for Spark
            mkdir -p ~/.julia/config/

    - Add following values to "~/.julia/config/startup.jl"
            ENV["SPARK_VERSION"] = "2.4.7"
            ENV["YARN_VERSION"] = "2.10.1"
            ENV["HADOOP_VERSION"] = "2.10.1"

    - Start Julia REPL
            ./julia-1.x.x/bin/julia

    - REPL
            julia> using Pkg
            julia> Pkg.add("Spark")
            julia> Pkg.build("Spark") //I was not able to run build before adding the package above
            julia> using Spark
            julia> Spark.init()
            julia> sc = SparkContext(master="yarn")
            julia> text = parallelize(sc, ["hello world", "the world is one", "we are the world"])

Once the parallelize is run I get the following error

    Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
            at org.apache.spark.api.julia.JuliaRDD$.<init>(JuliaRDD.scala:67)
            at org.apache.spark.api.julia.JuliaRDD$.<clinit>(JuliaRDD.scala)
            at org.apache.spark.api.julia.JuliaRDD.readRDDFromFile(JuliaRDD.scala)
    ERROR: JavaCall.JavaCallError("Error calling Java: java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.\$init\$(Lorg/apache/spark/internal/Logging;)V")
    Stacktrace:
     [1] geterror(::Bool) at /home/hadoop/.julia/packages/JavaCall/tjlYt/src/core.jl:418
     [2] geterror at /home/hadoop/.julia/packages/JavaCall/tjlYt/src/core.jl:403 [inlined]
     [3] _jcall(::JavaCall.JavaMetaClass{Symbol("org.apache.spark.api.julia.JuliaRDD")}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::Type{T} where T, ::Tuple{DataType,DataType,DataType}, ::JavaCall.JavaObject{Symbol("org.apache.spark.api.java.JavaSparkContext")}, ::Vararg{Any,N} where N) at /home/hadoop/.julia/packages/JavaCall/tjlYt/src/core.jl:373
     [4] jcall(::Type{JavaCall.JavaObject{Symbol("org.apache.spark.api.julia.JuliaRDD")}}, ::String, ::Type{T} where T, ::Tuple{DataType,DataType,DataType}, ::JavaCall.JavaObject{Symbol("org.apache.spark.api.java.JavaSparkContext")}, ::Vararg{Any,N} where N) at /home/hadoop/.julia/packages/JavaCall/tjlYt/src/core.jl:227
     [5] parallelize(::SparkContext, ::Array{String,1}; n_split::Int64) at /home/hadoop/.julia/packages/Spark/3MVGw/src/context.jl:88
     [6] parallelize(::SparkContext, ::Array{String,1}) at /home/hadoop/.julia/packages/Spark/3MVGw/src/context.jl:84
     [7] top-level scope at REPL[7]:1

Could you please let me know if I did it correctly? your input will really be helpful.

Thanks and Regards,
Sumit Malbari
malbarisumit@gmail.com
617-955-3382

Sorry, I mislead you on the environment variable names - to change the spark version you need to set BUILD_SPARK_VERSION env var. The relevant piece of the building script is here.

Note that there’s versions of Yarn and Hadoop cannot be configured this way. Yet you can run Maven manually. Here are approximate steps (checked on my laptop since I don’t have access to an EMR cluster at the moment):

(@v1.6) pkg> add Spark
...
(@v1.6) pkg> dev Spark
ERROR: could not find project file in package at `nothing` maybe `subdir` needs to be specified
# ^ it's ok to have this message

then in terminal:

# go to the Maven project subdir
cd ~/.julia/dev/Spark/jvm/sparkjl/

# this should build the binaries with the correct versions
mvn clean package -Dspark.version=2.4.7 -Dhadoop.version=2.10.1 -Dyarn.version=2.10.1

# go back to the top level of the package to get into the right environment
cd ../..

Then again in Julia REPL:

(@v1.6) pkg> activate .    # activate Spark.jl environment if it's not active
(Spark) pkg>

julia> using Spark
# the rest of your test

Hi dfdx,

I had followed the steps as defined in the comment but I’m still getting the same error when I do the parallelize operation.

Thanks and Regards,
Sumit Malbari
malbarisumit@gmail.com
617-955-3382

Can you connect to YARN from the normal Spark shell? Try downloading the Spark build closest to your EMR installation (perhaps spark-2.4.7-bin-hadoop2.7.tgz, but you can try several) and pointing it to YARN:

spark-shell --master yarn --deploy-mode client

If this works and you can parallelize collections in Scala using this method, try setting up the same versions for the Spark.jl build. If it does not work, I suggest opening a ticket in AWS support.

Hi Sumit,

I work on the EMR team and came across this post. After taking a look, I think you need to set BUILD_SCALA_VERSION to 2.11.12 when running Julia on EMR 5.33, or use this mvn command:
mvn clean package -Dspark.version=2.4.7 -Dscala.version=2.11.12 -Dscala.binary.version=2.11
The version of Spark on EMR is built with Scala 2.11, not Scala 2.12, which is what Spark.jl is using.

You may also need a more recent version of mvn for the build to compile - I was trying 3.2.2 and ran into some build errors with non-HTTP endpoints, but 3.8.1 worked fine.

And I think you’ll also need to make sure Julia is installed on every node on the cluster. Let me see if I can put together a bootstrap action for EMR 5.33.

Finally, I had to set the JULIA_COPY_STACKS=yes environment variable when running the REPL. This is mentioned in the JavaCall docs.

Hi dacort,

Thanks for your reply. I will try at my end. Yeah, It will be definitely helpful if I can get a bootstrap script.

Thanks and Regards,
Sumit Malbari
617-955-3382
malbarisumit@gmail.com

I was able to get it all the way to running the count command but for some reason the executors are failing. I think it’s something to do with how Spark.jl detects the package path but still debugging. I’ll post an update here tomorrow. :slight_smile:

1 Like

Here’s where I’m at.

If I start with SparkContext(master="local") it seems to work OK! But if I use SparkContext(master="yarn") then I get an error from the executor as it can’t seem to find the Spark package when executing Base.find_package("Spark") from JuliaRDD.scala:84.

I’m guessing this is because I’m doing the install as the hadoop user but the yarn containers don’t have access to the Spark package…but I’m a little rusty on how things get distributed in the different Yarn modes.

For reference, I’ll include the bootstrap action I’m using to install Julia and the AWS CLI command I"m using…

#!/bin/bash

# install julia
curl -sL https://julialang-s3.julialang.org/bin/linux/x64/1.0/julia-1.0.5-linux-x86_64.tar.gz | sudo tar -xz -C /usr/local/
JULIA_DIR=/usr/local/julia-1.0.5
JULIA_HOME=/usr/local/julia-1.0.5/bin

# install maven
curl -s https://mirrors.sonic.net/apache/maven/maven-3/3.8.1/binaries/apache-maven-3.8.1-bin.tar.gz | sudo tar -xz -C /usr/local/
MAVEN_DIR=/usr/local/apache-maven-3.8.1

# Update the hadoop users path with Maven and Julia
echo "export PATH=$MAVEN_DIR/bin:$JULIA_HOME:$PATH" >> /home/hadoop/.bashrc
export PATH=$MAVEN_DIR/bin:$JULIA_HOME:$PATH

# Add necessary environment variables to Julia config
export TARGET_USER=hadoop
export JULIA_CFG_DIR="/home/${TARGET_USER}/.julia/config"
mkdir -p ${JULIA_CFG_DIR} && \
    touch ${JULIA_CFG_DIR}/startup.jl && \
    chown -R hadoop.hadoop /home/hadoop/.julia

echo 'ENV["SPARK_HOME"] = "/usr/lib/spark/"' >> "${JULIA_CFG_DIR}/startup.jl"
echo 'ENV["HADOOP_CONF_DIR"] = "/etc/hadoop/conf"' >> "${JULIA_CFG_DIR}/startup.jl"

# Install Spark.jl - we need to explicity define Spark/Scala versions here
# Note: I'm not 100% I need to do this on every node.
# If it gets installed in SPARK_HOME, I probably don't need to.
BUILD_SCALA_VERSION=2.11.12 \
BUILD_SPARK_VERSION=2.4.7 \
JULIA_COPY_STACKS=yes \
julia -e 'using Pkg;Pkg.add(Pkg.PackageSpec(;name="Spark", version="0.5.1"));using Spark;'
KEYPAIR=<ssh_keypair>
JULIA_INSTALL_SCRIPT=s3://<bucket>/artifacts/bootstrap/julia-spark.sh
LOG_BUCKET=aws-logs-<accout_id>-us-west-2

aws emr create-cluster --name "emr-julia-spark" \
    --region us-west-2 \
    --release-label emr-5.33.0 \
    --enable-debugging \
    --log-uri "s3n://${LOG_BUCKET}/elasticmapreduce/" \
    --applications Name=Spark \
    --ec2-attributes KeyName=${KEYPAIR} \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --use-default-roles \
    --configurations '[{"Classification":"spark-defaults","Properties":{"spark.executorEnv.JULIA_HOME":"/usr/local/julia-1.0.5/bin","spark.executorEnv.JULIA_PKGDIR": "/home/hadoop/.julia/packages","spark.executorEnv.JULIA_VERSION":"v1.0.5"}}]' \
    --bootstrap-actions Path=${JULIA_INSTALL_SCRIPT},Name=Install_Julia_Spark

And my test script…

JULIA_COPY_STACKS=yes julia
using Spark;Spark.init();sc = SparkContext(master="yarn")

text = parallelize(sc, ["hello world", "the world is one", "we are the world"])
rdd = map_partitions(text, it -> map(s -> length(split(s)), it))
count(rdd)

Think I got it! The underlying challenge had to deal with my installation was putting the package files in /home/hadoop/.julia, but when Yarn was running it didn’t have access to those. Looks like the code also uses JULIA_PKGDIR, which is deprecated, and it didn’t look like I could get the packages moved into the right structure the code expected.

So I set JULIA_DEPOT_PATH when installing Spark.jl as well as for the Yarn executor environments. I’m including my updated bootstrap script and AWS CLI to start the cluster.

#!/bin/bash

# install julia
curl -sL https://julialang-s3.julialang.org/bin/linux/x64/1.0/julia-1.0.5-linux-x86_64.tar.gz | sudo tar -xz -C /usr/local/
JULIA_DIR=/usr/local/julia-1.0.5
JULIA_HOME=/usr/local/julia-1.0.5/bin

# install maven
curl -s https://mirrors.sonic.net/apache/maven/maven-3/3.8.1/binaries/apache-maven-3.8.1-bin.tar.gz | sudo tar -xz -C /usr/local/
MAVEN_DIR=/usr/local/apache-maven-3.8.1

# Update the `hadoop` user's path with Maven and Julia
echo "export PATH=$MAVEN_DIR/bin:$JULIA_HOME:$PATH" >> /home/hadoop/.bashrc
export PATH=$MAVEN_DIR/bin:$JULIA_HOME:$PATH

# Add necessary environment variables to Julia config
# 1. We'll create a shared package dir for the installation
sudo mkdir -p /usr/local/share/julia/v1.0.5 && \
    sudo chown -R hadoop.hadoop /usr/local/share/julia/ && \
    sudo chmod -R go+r /usr/local/share/julia/

# 2. Create a config file that adds Spark environment variables
#    and adds the new package dir to the DEPOT_PATH.
export TARGET_USER=hadoop
export JULIA_CFG_DIR="/home/${TARGET_USER}/.julia/config"
mkdir -p ${JULIA_CFG_DIR} && \
    touch ${JULIA_CFG_DIR}/startup.jl && \
    chown -R hadoop.hadoop /home/hadoop/.julia

echo 'ENV["SPARK_HOME"] = "/usr/lib/spark/"' >> "${JULIA_CFG_DIR}/startup.jl"
echo 'ENV["HADOOP_CONF_DIR"] = "/etc/hadoop/conf"' >> "${JULIA_CFG_DIR}/startup.jl"
echo 'push!(DEPOT_PATH, "/usr/local/share/julia/v1.0.5")' >> "${JULIA_CFG_DIR}/startup.jl"

# Install Spark.jl - we need to explicity define Spark/Scala versions here
# Note: I'm not 100% I need to do this on every node.
# If it gets installed in SPARK_HOME, I probably don't need to.
BUILD_SCALA_VERSION=2.11.12 \
BUILD_SPARK_VERSION=2.4.7 \
JULIA_COPY_STACKS=yes \
JULIA_DEPOT_PATH=/usr/local/share/julia/v1.0.5 \
julia -e 'using Pkg;Pkg.add(Pkg.PackageSpec(;name="Spark", version="0.5.1"));using Spark;'
KEYPAIR=<ssh_keypair>
JULIA_INSTALL_SCRIPT=s3://<bucket>/artifacts/bootstrap/julia-spark.sh
LOG_BUCKET=aws-logs-<accout_id>-us-west-2

aws emr create-cluster --name "emr-julia-spark" \
    --region us-west-2 \
    --release-label emr-5.33.0 \
    --enable-debugging \
    --log-uri "s3n://${LOG_BUCKET}/elasticmapreduce/" \
    --applications Name=Spark \
    --ec2-attributes KeyName=${KEYPAIR} \
    --instance-type m5.xlarge \
    --instance-count 3 \
    --use-default-roles \
    --configurations '[{"Classification":"spark-defaults","Properties":{"spark.executorEnv.JULIA_HOME":"/usr/local/julia-1.0.5/bin","spark.executorEnv.JULIA_DEPOT_PATH": "/usr/local/share/julia/v1.0.5","spark.executorEnv.JULIA_VERSION":"v1.0.5"}}]' \
    --bootstrap-actions Path=${JULIA_INSTALL_SCRIPT},Name=Install_Julia_Spark
1 Like

Hi dacort,

Thanks for looking into this. But I see that you have used Julia version 1.0.5. Can’t I use version 1.6.1 which is the latest? Also, did it work for Yarn also?

Thanks and Regards,
Sumit Malbari

What can Julia and Spark actually do together?

  1. Julia can act as the driver program?
  2. Julia can be in the executors, so it’s distributed in the cluster?
  3. Both?

I wanted to start with Julia 1.0.5 as that was the version used in this InstallJuliaEMR.sh script and I wanted to start with a known-good configuration.

In addition, Spark.jl mentions in the readme that it’s only tested with Julia 1.0 and 1.4. Feel free to give it a shot - 1.6.2 was just released on the 14th - by updating the download url and replacing 1.0.5 with 1.6.2 everywhere in the bootstrap action and create-cluster command. I’m giving that a shot now just to see.

And yes, it did work for Yarn. The changes I made with the JULIA_DEPOT_PATH helped enable that.

1 Like

Tested with 1.6.2 and it seems to work…your mileage may vary though as the Spark.jl library doesn’t say it supports it.

1 Like

I’m not as familiar with Julia, but yes I believe the implementation allows for both of those options. The README specifically states:

It supports running pure Julia scripts on Julia data structures, while utilising the data and code distribution capabalities of Apache Spark. It supports multiple cluster types (in client mode), and can be consider as an analogue to PySpark or RSpark within the Julia ecosystem.

1 Like

Maybe. It’s hard to say for sure without any actual examples available.

You can use Julia both - on a driver and on executors. The RDD API is supported on any environment where Julia itself is installed, the most basic example can be found here. The Dataset API is partially supported with a big exception for UDFs, which are non-trivial to implement outside of the main Spark codebase. I have several crazy ideas how to change it, but I need a couple of actual use cases to turn them into plans.

2 Likes

Hi dacort,

I tried using your InstallJuliaEMR.sh file. But I am getting bootstrap error as below:

Building Spark → /usr/local/share/julia/v1.6.2/scratchspaces/44cfe95a-1eb2-52ea-b672-e2afdf69b78f/7e1ea83d0b89cddef23f8e834fe74359b573b5aa/build.log
ERROR: Error building Spark:
/usr/local/apache-maven-3.8.1/bin/mvn

For a detailed error description, please find the link.
Bootstrap error

Can you please help me over here?

Thanks and Regards,
Sumit Malbari
6179553382

Hi @dacort,

If possible could you please help me with this issue?

Thanks and Regards,
Sumit Malbari