Setting up Julia on Spark on AWS EMR

The log you posted points out that the following command launched via the build script fails:

mvn clean package -Dspark.version=2.4.7 -Dscala.version=2.11.12 -Dscala.binary.version=2.11

Can you run this command directly from the <Spark.jl Dir>/jvm/sparkjl directory and post the result?

Yea, based on this error looks like thereā€™s some issue with outbound access to the maven repo?

[WARNING] Could not transfer metadata org.apache.maven.plugins:maven-source-plugin/maven-metadata.xml from/to central (https://repo.maven.apache.org/maven2): transfer failed for https://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-source-plugin/maven-metadata.xml

Not sure why that would be, though, since Iā€™m guessing the other downloads worked fine?

Hi @dacort , @dfdx ,

I had to create the settings.xml file in the .m2 folder in my home directory so as to allow access to maven through artifactory directory of my company. This worked fine. Now, I am facing another issue while executing the below command:

julia -e ā€˜using Pkg;Pkg.add(Pkg.PackageSpec(;name=ā€œSparkā€, version=ā€œ0.5.1ā€));using Spark;Spark.init();sc = SparkContext(master=ā€œyarnā€);sc.parallelize([1,2,3,4])ā€™

Please find the below error:

ERROR spark.SparkContext: Failed to add /home/hadoop/.julia/packages/Spark/9bsuG/src/ā€¦/jvm/sparkjl/target/sparkjl-0.1.jar to Spark environment
java.io.FileNotFoundException: Jar /home/hadoop/.julia/packages/Spark/9bsuG/src/ā€¦/jvm/sparkjl/target/sparkjl-0.1.jar not found
at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1874)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1902)
at org.apache.spark.api.java.JavaSparkContext.addJar(JavaSparkContext.scala:701)
ERROR: type SparkContext has no field parallelize

Could you please let me know what can be done about this?

Thanks and Regards,
Sumit Malbari

It looks the JAR file hasnā€™t been created. Please repeat the build and verify the file is actually there.

hi @dfdx , @dacort

I built the spark and its showing success, but when I am trying to run below command, I am getting below error:

julia> text = parallelize(sc, [ā€œhello worldā€, ā€œthe world is oneā€, ā€œwe are the worldā€])
Exception in thread ā€œmainā€ java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.init(Lorg/apache/spark/internal/Logging;)V
at org.apache.spark.api.julia.JuliaRDD$.(JuliaRDD.scala:67)
at org.apache.spark.api.julia.JuliaRDD$.(JuliaRDD.scala)
at org.apache.spark.api.julia.JuliaRDD.readRDDFromFile(JuliaRDD.scala)
ERROR: JavaCall.JavaCallError(ā€œError calling Java: java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)Vā€)
Stacktrace:

Anyidea on this issue?

Thanks and Regards,
Sumit Malbari