Hi dfdx,
Launched EMR cluster version EMR-5.33.0 with Spark selected.
I have tried both versions of Julia listed below
* wget https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6-latest-linux-x86_64.tar.gz
* wget https://julialang-s3.julialang.org/bin/linux/x64/1.4/julia-1.4-latest-linux-x86_64.tar.gz
- Extract file to home
tar xvfz julia-1.6.0-linux-x86_64.tar.gz -C ~
- Install Maven for supporting adding Spark to Julia
sudo yum install -y maven
- Set environment variables as listed below
export HADOOP_HOME=/usr/lib/hadoop
export SPARK_HOME=/usr/lib/spark/
export HADOOP_CONF_DIR=/etc/hadoop/conf
export JULIA_COPY_STACKS=1
- Create julia startup configuration directory to set Env variable for Spark
mkdir -p ~/.julia/config/
- Add following values to "~/.julia/config/startup.jl"
ENV["SPARK_VERSION"] = "2.4.7"
ENV["YARN_VERSION"] = "2.10.1"
ENV["HADOOP_VERSION"] = "2.10.1"
- Start Julia REPL
./julia-1.x.x/bin/julia
- REPL
julia> using Pkg
julia> Pkg.add("Spark")
julia> Pkg.build("Spark") //I was not able to run build before adding the package above
julia> using Spark
julia> Spark.init()
julia> sc = SparkContext(master="yarn")
julia> text = parallelize(sc, ["hello world", "the world is one", "we are the world"])
Once the parallelize is run I get the following error
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
at org.apache.spark.api.julia.JuliaRDD$.<init>(JuliaRDD.scala:67)
at org.apache.spark.api.julia.JuliaRDD$.<clinit>(JuliaRDD.scala)
at org.apache.spark.api.julia.JuliaRDD.readRDDFromFile(JuliaRDD.scala)
ERROR: JavaCall.JavaCallError("Error calling Java: java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.\$init\$(Lorg/apache/spark/internal/Logging;)V")
Stacktrace:
[1] geterror(::Bool) at /home/hadoop/.julia/packages/JavaCall/tjlYt/src/core.jl:418
[2] geterror at /home/hadoop/.julia/packages/JavaCall/tjlYt/src/core.jl:403 [inlined]
[3] _jcall(::JavaCall.JavaMetaClass{Symbol("org.apache.spark.api.julia.JuliaRDD")}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::Type{T} where T, ::Tuple{DataType,DataType,DataType}, ::JavaCall.JavaObject{Symbol("org.apache.spark.api.java.JavaSparkContext")}, ::Vararg{Any,N} where N) at /home/hadoop/.julia/packages/JavaCall/tjlYt/src/core.jl:373
[4] jcall(::Type{JavaCall.JavaObject{Symbol("org.apache.spark.api.julia.JuliaRDD")}}, ::String, ::Type{T} where T, ::Tuple{DataType,DataType,DataType}, ::JavaCall.JavaObject{Symbol("org.apache.spark.api.java.JavaSparkContext")}, ::Vararg{Any,N} where N) at /home/hadoop/.julia/packages/JavaCall/tjlYt/src/core.jl:227
[5] parallelize(::SparkContext, ::Array{String,1}; n_split::Int64) at /home/hadoop/.julia/packages/Spark/3MVGw/src/context.jl:88
[6] parallelize(::SparkContext, ::Array{String,1}) at /home/hadoop/.julia/packages/Spark/3MVGw/src/context.jl:84
[7] top-level scope at REPL[7]:1
Could you please let me know if I did it correctly? your input will really be helpful.
Thanks and Regards,
Sumit Malbari
malbarisumit@gmail.com
617-955-3382