Connectiong to the Hive metastore on hdfs using Hive.jl or Spark.jl

dfdx · April 3, 2019, 6:56am

Please try also:

sess = SparkSession(master="yarn-client", enable_hive_support=true)

Presumably this should fix “Detected yarn cluster mode” error, though “hpd-select” related may still not work. Later today I’ll try to explain each of these errors.

EDIT: updated yarn-cluster to yarn-client.

dfdx · April 3, 2019, 10:03pm

Here are some more details on the errors you’ve mentioned:

File “/usr/bin/hdp-select”, line 205
print "ERROR: Invalid package - " + name

This is not part of Spark.jl or even Spark itself, but instead of Hadoop installation you use - Hortonworks / Apache Ambari. I don’t now much about their stack, but it seems like they still use Python 2 for their scripts, while default environment executable is Python 3. Although Spark.jl is not related to PySpark, the following environment variables may help:

PYSPARK_DRIVER_PYTHON=python2 PYSPARK_PYTHON=python2

Another option is to create virtualenv / conda env with Python 2 as default.

It’s also very possible that this error doesn’t actually prevent you from running the code, but only pollutes the log, so I’d start from checking other things first.

org.apache.spark.SparkException: Detected yarn cluster mode, but isn’t running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

Mea culpa, I totally forgot we don’t support cluster mode. Maybe one day when Spark.jl is integrated into the main Spark distribution we will be able to launch it on server using spark-submit, but it’s not something which is going to happen anywhere soon.

On the good side, in most cases you shouldn’t notice any difference between cluster and client mode.

Container exited with a non-zero exit code 1

This is always the final error that just tells that something went wrong. The actual cause is described somewhere higher in the log.

This actually makes a connection, launches the sparkui, but doesn’t register as a connection on the platform.

Yep, this is exactly the expected result - local executor helped to distinguish between Spark.jl issues and issues connecting to YARN.

As a solution to all previous issues, please try:

sess = SparkSession(master="yarn-client", enable_hive_support=true)

Spark.jl will try to connect to YARN in client mode (it should automatically read YARN address from the system configuration) and to Hive metastore (using provided hive-site.xml). Please let us know if this is enough to read data from Hive tables.

Fredo_XVII · April 4, 2019, 2:59pm

Good morning,

So yarn-client worked to establish a connection, but the connection only lasted 1:45min. I will try to describe the steps that lead up to the connection, then I will describe the failures. There are a bunch of stuff in between; I select what seemed relevant. The SPARK_HOME directory has all the relevant files and a pointer to the hive-server.xml

Connection output:

WARN spark.SparkConf: spark.master yarn-client is deprecated in Spark 2.0+, please instead use "yarn" with specified deploy mode.

INFO spark.SparkContext: Submitted application: Julia App on Spark

INFO util.Utils: Successfully started service 'sparkDriver' on port xxxxx.

INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up

util.Utils: Successfully started service 'SparkUI' on port xxxx.

INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://server.co.com:xxxx

INFO yarn.Client: Submitting application application_XXXXXX_XXXX  to ResourceManager
INFO impl.YarnClientImpl: Application submission is not finished, submitted application application_XXXXXX_XXXX is still in SUBMITTED
INFO impl.YarnClientImpl: Submitted application application_XXXXXX_XXXX 
INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_XXXXXX_XXXX and attemptId None
INFO yarn.Client: Application report for application_XXXXXX_XXXX  (state: ACCEPTED)
INFO yarn.Client: Application report for application_XXXXXX_XXXX  (state: RUNNING)
INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
INFO spark.SparkContext: Added JAR /home_dir/XXXXX/.julia/packages/Spark/zK34P/src/../jvm/sparkjl/XXXX/sparkjl-0.1.jar at spark://server.co.com:port/jars/sparkjl-0.1.jar with timestamp XXXXX

SparkSession(yarn-client,Julia App on Spark) - SUCCESS

Errors begin 9 seconds later:

WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 2 for reason Container marked as failed: container_application_XXXXXX_XXXX  on host: server.co.com. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_application_XXXXXX_XXXX 
Exit code: 1
Stack trace: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DefaultLinuxContainerRuntime.launchContainer(DefaultLinuxContainerRuntime.java:109)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:89)
	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:392)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:317)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1 (repeat several times)


INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
INFO storage.BlockManagerMaster: Removal of executor 4 requested
INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 4
ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED!
INFO server.AbstractConnector: Stopped Spark@1a925d69{HTTP/1.1,[http/1.1]}{0.0.0.0:xxxx}
INFO ui.SparkUI: Stopped Spark web UI at http://server.co.com:xxxx
WARN server.TransportChannelHandler: Exception in connection from /xx.xx.xxx.xxx:xxxxx

Fredo_XVII · April 4, 2019, 3:02pm

Can you tell what the ENV would be for the Spark warehouse? I would like to set that up before I try to pull data from the cluster. The spark warehouse is pointing to my home directory and I need to change it to the folder on the cluster.

Spark warehouse error:

INFO internal.SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home_dir/user/JULIA/spark-warehouse').

Fredo_XVII · April 4, 2019, 3:04pm

Yes, we just upgraded the cluster to python 3, and the ENV is set up for python2. Like with everything else, they do this so that current production jobs don’t break. I will look more into getting this fixed.

I can’t believe you figured that out from the little bit of error that was produced from that.

Thanks!!!

dfdx · April 4, 2019, 10:49pm

From the log it’s unclear whether you have an issue with YARN or Hive, so I’d check the following:

Create SparkSession without Hive support. If it works fine, then the issue is with Hive config.
Check out worker/executor logs which may contain more information on why container crashes. They may be available somewhere in UI or via YARN command line interface:

yarn logs -applicationId application_XXXX_0001 > appID_XXXXX_0001.log

Fredo_XVII · April 5, 2019, 3:12am

Hello,

I tried the Yarn without Hive but the same error occurred. I ran the connection string again with the yarn-client and captured the log. I have pasted what seems like the last time it tried to connect.

# Creating copy of launch script
cp "launch_container.sh" "/grid/0/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxxzzzzz_01_000002/launch_container.sh"
chmod 640 "/grid/0/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxx_zzzzz_01_000002/launch_container.sh"
# Determining directory contents
echo "ls -l:" 1>"/grid/0/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxx_zzzzz_01_000002/directory.info"
ls -l 1>>"/grid/0/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxx_zzzzz_01_000002/directory.info"
echo "find -L . -maxdepth 5 -ls:" 1>>"/grid/0/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxx_zzzzz_01_000002/directory.info"
find -L . -maxdepth 5 -ls 1>>"/grid/0/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxx_zzzzz_01_000002/directory.info"
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/grid/0/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxx_zzzzz_01_000002/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/grid/0/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxx_zzzzz_01_000002/directory.info"
exec /bin/bash -c "$JAVA_HOME/bin/java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp '-Dspark.driver.port=xxxxx' -Dspark.yarn.app.container.log.dir=/grid/5/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxx_zzzzz_01_000002 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@serverXXXX.co.com:xxxx--executor-id 1 --hostname serverXXXX.co.com --cores 1 --app-id application_xxxxx_zzzzz --user-class-path file:$PWD/__app__.jar 1>/grid/5/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxx_zzzzz_01_000002/stdout 2>/grid/5/hadoop/yarn/log/application_xxxxx_zzzzz/container_e177_xxxxx_zzzzz_01_000002/stderr"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi

End of LogType:launch_container.sh

LogType:stderr
Log Upload Time:Thu Apr 04 21:38:07 -0500 2019
LogLength:96
Log Contents:
Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend

End of LogType:stderr

LogType:stdout
Log Upload Time:Thu Apr 04 21:38:07 -0500 2019
LogLength:0
Log Contents:

End of LogType:stdout

dfdx · April 7, 2019, 3:36pm

Error: Could not find or load main class org.apache.spark.executor.CoarseGrainedExecutorBackend

This looks like mismatch between your Hadoop/Spark installation and Spark.jl’s version of base libraries. Please ask your Hadoop admin about:

Spark version
Hadoop version
YARN version

If they are different from what we have, try changing the version in that file to the ones in your cluster, rebuild Spark (Pkg.build("Spark")) and try again.

Quite unlikely to work, but you may also try solution from here, e.g.:

sess = SparkSession(master="yarn-client", 
                    config=Dict("spark.driver.extraJavaOptions" => "-Diop.version=4.1.0.0"))

Fredo_XVII · April 9, 2019, 4:54am

I tried the pie in the sky config = but didn’t work. I even found a -Dxxx java-option in our file system, but that also didn’t work. I also tried to change the pom.xml file in the packages/Spark/kFCaM/jvm/sparkjl/ but it did not allow me to make the change. Something about a permissions error ( Error writing pom.xml: Permission denied). I do have read and write permissions as this is my home directory.

Our current config is below:

    <spark.version>[2.3.1,)</spark.version>
    <hadoop.version>2.5.3</hadoop.version>
    <yarn.version>2.7.3</yarn.version>

Is there a config=Dict() option that would allow me to set those in the SparkSession?

dfdx · April 9, 2019, 10:11pm

It turns out Julia sets read-only permissions to some files of installed packages. If you have fully functional Julia terminal, you can overcome it using ] dev Spark command which will switch to development version of the package and remove all restrictions.

(Pkg.develop("Spark") was also supposed to work, but it didn’t on my system. Anyway, all the files in the package are still owned by you, so you should be able to change the permissions)

Alternatively, you can build Spark.jl JARs manually and provide library versions as parameters:

cd ~/.julia/packages/Spark/kFCaM/jvm/sparkjl/  # or wherever your package is installed
maven -Dspark.version=2.3.1 -Dhadoop.version=2.5.3 -Dyarn.version=2.7.3

Note that whatever method you use, you should have maven installed on your system.

Fredo_XVII · April 18, 2019, 5:05pm

@dfdx It appears that I am not allowed to install Maven on the platform. I have tried several times but the build bombs out with an unexpected error. I guess at this point we gave a hell of a try to get this figured out.

I will attending the conference in July, maybe we can get together if you are going to see if being face to face makes this easier to resolve.

I don’t know what else we can do at this point. If you have any other ideas, I will keep trying them out.

Thank you for all your help trying to get this resolved.

Topic		Replies	Views
Setting up Julia on Spark on AWS EMR Julia at Scale question , spark , ijulia	24	2425	January 4, 2022
JavaCallError when using Spark.jl General Usage	1	205	June 18, 2023
not able to connect to Hive with Hive.jl General Usage package	0	338	May 23, 2019
What fileformat to use to load data for high performance computing Machine Learning	37	7024	December 1, 2018
When will Julia compete with Spark? Julia at Scale announcement , spark	16	8683	June 5, 2021

Connectiong to the Hive metastore on hdfs using Hive.jl or Spark.jl

Related topics