Connectiong to the Hive metastore on hdfs using Hive.jl or Spark.jl

Fredo_XVII · March 15, 2019, 5:14pm

Hello,

I am poc`ing Julia at work and I am trying to test it on our Hadoop environment. I have Julia 1.0.3 and Julia 1.0 installed. Running Julia through miniconda on our edge node.

Things that I have tried:

Using ODBC: I was able to connect to Hive from my laptop using ODBC.jl, but there was no performance improvement over using R, which I expected. This is great for working with tiny data for development that will run on the hadoop platform.
Using Spark: I was able to instantiate a Spark session on our platform using Jupyter, but I was not able to read any tables from our Hive environment. With SparkR, you can set “enableHiveSupport = TRUE” so that the SparkR session knows where to find the hive tables. I really would love to use SparkSQL to grab the data from Hive, this works really well with SparkR::sql().
Using Hive: Unlike Spark.jl, I was not able to connect to Hive with Hive.jl, or Elly.jl.

I am really interested in adopting at work, but without a proof of concept, our hadoop team will not implement julia on the platform. So for now I just want to be able to grab data out of our Hive databas environment.

Thank you!

Alfredo

avik · March 15, 2019, 8:32pm

I would think your best bet is to use Hive.jl. We’ve successfully used Hive.jl against reasonably large datasets, and it works well. However, given the many versions and configuration of Hive and Hadoop, something in your environment is probably incompatible with what Hive.jl expects. In particular, check what transport and auth your setup requires, and if Hive.jl supports it yet.

What errors do you see when you use Hive.jl?

Regards

Avik

dfdx · March 15, 2019, 8:36pm

At first glance, it also looks easy to fix hive support in Spark. If I make a change in a branch, will you be able to test it in your cluster?

Fredo_XVII · March 18, 2019, 3:38pm

Hello Avik,

From what I can tell you, we connect to hive through knox using LAN; we also have keberos for logging in. Could you provide the code that you use to make the connection to your hive instance (X out things that are sensitive)?

When I log in into our hadoop environement, I can type hive to launch a hive session, which means that Hive is already available. So what I need to do is get Julia to find the Hive session, launch a session from within Julia.

It is totally possible that I am thinking about this all wrong as I am not familiar enough with Julia to understand why it is breaking.

Alfredo

Fredo_XVII · March 18, 2019, 3:38pm

Hello dfdx,

I think so, if all it entails is to re-installing Julia. You would have to give me the path to the dev version of Julia I believe.

This would be great thank you!

Alfredo

Fredo_XVII · March 19, 2019, 8:24pm

Yes, I believe so. Please let me know when you want me to do the test.

Thanks!

dfdx · March 19, 2019, 8:43pm

Cool! I think I’ll have time tomorrow and will let you know.

dfdx · March 21, 2019, 10:56pm

Ok, Hive support is an optional dependency and should be plugged in during Spark.jl build. However, there seems to be no guide describing what’s the exact list of dependent libraries to add. I’m experimenting with options here, but it may take time.

Fredo_XVII · March 22, 2019, 2:57pm

Great! I am currently working with Julia 1.0.3 due to that being the current version in the anaconda package list, just fyi, on a linux platfrom (CentOS), 64bit.

I don’t know if this helps, but when I launch SparkR, I normally specify enable hive support = true…see below. This is how we get Spark to see the Hive tables (metastore/warehouse). As another note, I am a user of Julia/R/Hive, as I work in analytics, I am not a software engineer. In other words, I will be good for testing if things are working from a user perspective, which I am very excited about .

# finds SparkR in hadoop
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
    
# Create a SparkR session...connecting to Spark from the edge node
sparkR.session(appName = "SuperCoolApp", master = "yarn", mode = "cluster", 
                sparkConfig = list(spark.driver.memory = "10g", spark.executor.memory = "22g", spark.driver.cores = "250",
                                   spark.rpc.message.maxSize = "1024"))
    
# finds Hive Metadata warehouse on hadoop
sparkR.session(enableHiveSupport = TRUE)

dfdx · March 25, 2019, 10:37pm

I think I’ve got it working, but setup may be tricky.

Make sure SPARK_HOME environment variable is set and points to the location of your Spark installation on a driver machine. From your code above I suppose it does, but just in case.
Make sure you have ${SPARK_HOME}/conf/hive-site.xml and it is correct configuration for your Hive cluster. Hadoop configuration may come in a variety of ways, if the file isn’t there, you may need help of your Hadoop administrator to locate it.
Checkout hive-support branch of Spark.jl. From Jupyter notebook it should be something like:

using Pkg
Pkg.add("Spark#hive-support")

Finally, create a Spark session with enable_hive_support=true and check your Hive tables:

using Spark
Spark.init()
sess = SparkSession(enable_hive_support=true)
sql(sess, "show tables")

Note that I don’t currently have complete Hive installation, so I only checked that the code loads and doesn’t produce errors until this point. If it’s successful for you, please let me know and I will merge the changes.

Fredo_XVII · March 26, 2019, 2:12pm

Hello,

So I tested the code provided on Julia 1.0.3 and Julia 1.1.0 on hour linux hadoop environment and I got the error below. It appears that it can’t find the dev package.

julia> using Pkg

julia> Pkg.add("Spark#hive-support")
ERROR: Spark#hive-support is not a valid packagename

dfdx · March 26, 2019, 2:28pm

Ah, turns out syntax for Pkg is a bit different. Please try this:

using Pkg
Pkg.add(PackageSpec(name="Spark", rev="hive-support"))

Fredo_XVII · March 26, 2019, 3:02pm

I got the Spark.jl to install. Thanks for the updated code. Code bombed on launching Spark. See code and output below:

julia> using Pkg

julia> Pkg.add(PackageSpec(name="Spark", rev="hive-support"))
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Cloning git-repo `https://github.com/dfdx/Spark.jl.git`
  Updating git-repo `https://github.com/dfdx/Spark.jl.git`
 Resolving package versions...
 Installed SoftGlobalScope ──── v1.0.10
 Installed Compat ───────────── v2.1.0
 Installed Tables ───────────── v0.1.18
 Installed WeakRefStrings ───── v0.5.8
 Installed TranscodingStreams ─ v0.9.3
  Updating `~/.julia/environments/v1.0/Project.toml`
  [e3819d11] ↑ Spark v0.4.0 ⇒ v0.4.0+ #hive-support (https://github.com/dfdx/Spark.jl.git)
  Updating `~/.julia/environments/v1.0/Manifest.toml`
  [34da2185] ↑ Compat v2.0.0 ⇒ v2.1.0
  [b85f4697] ↑ SoftGlobalScope v1.0.9 ⇒ v1.0.10
  [e3819d11] ↑ Spark v0.4.0 ⇒ v0.4.0+ #hive-support (https://github.com/dfdx/Spark.jl.git)
  [bd369af6] ↑ Tables v0.1.17 ⇒ v0.1.18
  [3bb67fe8] ↑ TranscodingStreams v0.9.0 ⇒ v0.9.3
  [ea10d353] ↑ WeakRefStrings v0.5.7 ⇒ v0.5.8
  Building Spark → `~/.julia/packages/Spark/zK34P/deps/build.log`


julia>

julia> using Spark
[ Info: Recompiling stale cache file /home_dir/xxxxx/.julia/compiled/v1.0/Spark/zpJEw.ji for Spark [e3819d11-95af-5eea-9727-70c091663a01]
ERROR: LoadError: LoadError: InitError: JavaCall.JavaCallError("Cannot find java library libjvm.so\nSearch Path:\n   /home_dir/xxxxx/.anaconda/bin")
Stacktrace:
 [1] findjvm() at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/jvm.jl:109
 [2] __init__() at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/JavaCall.jl:32
 [3] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:633
 [4] _require_from_serialized(::String) at ./loading.jl:684
 [5] _require(::Base.PkgId) at ./loading.jl:967
 [6] require(::Base.PkgId) at ./loading.jl:858
 [7] require(::Module, ::Symbol) at ./loading.jl:853
 [8] include at ./boot.jl:317 [inlined]
 [9] include_relative(::Module, ::String) at ./loading.jl:1044
 [10] include at ./sysimg.jl:29 [inlined]
 [11] include(::String) at /home_dir/xxxxx/.julia/packages/Spark/zK34P/src/Spark.jl:1
 [12] top-level scope at none:0
 [13] include at ./boot.jl:317 [inlined]
 [14] include_relative(::Module, ::String) at ./loading.jl:1044
 [15] include(::Module, ::String) at ./sysimg.jl:29
 [16] top-level scope at none:2
 [17] eval at ./boot.jl:319 [inlined]
 [18] eval(::Expr) at ./client.jl:393
 [19] top-level scope at ./none:3
during initialization of module JavaCall
in expression starting at /home_dir/xxxxx/.julia/packages/Spark/zK34P/src/core.jl:2
in expression starting at /home_dir/xxxxx/.julia/packages/Spark/zK34P/src/Spark.jl:48
ERROR: Failed to precompile Spark [e3819d11-95af-5eea-9727-70c091663a01] to /home_dir/xxxxx/.julia/compiled/v1.0/Spark/zpJEw.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1203
 [3] _require(::Base.PkgId) at ./loading.jl:960
 [4] require(::Base.PkgId) at ./loading.jl:858
 [5] require(::Module, ::Symbol) at ./loading.jl:853

avik · March 28, 2019, 10:53pm

Set a JAVA_HOME environment variable pointing to you JDK home directory.

Fredo_XVII · April 1, 2019, 3:08pm

So the Java_Home solved the install problem, but I am getting a lot of other errors. I need to clean them up before I can post them here. Just to give you a quick summary, the spark session fails after 3 executor attempts to connect.

dfdx · April 1, 2019, 10:28pm

Executor will fail to connect either to master, which is unlikely, or to the driver, i.e. Julia program you run the app from. Usually this means some kind of driver or cluster misconfiguration. The first thing to check is address that the executor tries to connect to. You can also try different options for master parameter, e.g.:

sess = SparkSession(master="yarn", enable_hive_support=true)
sess = SparkSession(master="yarn-cluster", enable_hive_support=true)
sess = SparkSession(master="local", enable_hive_support=true)

Fredo_XVII · April 3, 2019, 4:28am

The first error that I get is when running the initial spark code below.

using Spark
Spark.init()

output:

  File "/usr/bin/hdp-select", line 205
    print "ERROR: Invalid package - " + name
                                    ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?

Fredo_XVII · April 3, 2019, 4:36am

Tried sess = SparkSession(master="yarn-cluster", enable_hive_support=true)

Errors:

19/04/02 23:28:42 INFO spark.SparkContext: Running Spark version 2.3.1.xxx.xxx
19/04/02 23:28:42 WARN spark.SparkConf: spark.master yarn-cluster is deprecated in Spark 2.0+, please instead use "yarn" with specified deploy mode.
19/04/02 23:28:42 INFO spark.SparkContext: Submitted application: Julia App on Spark
19/04/02 23:28:42 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:378)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)
19/04/02 23:28:42 ERROR util.Utils: Uncaught exception in thread main
java.lang.NullPointerException
	at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$postApplicationEnd(SparkContext.scala:2389)
	at org.apache.spark.SparkContext$$anonfun$stop$1.apply$mcV$sp(SparkContext.scala:1904)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:1903)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:579)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)
19/04/02 23:28:42 INFO spark.SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:378)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)

Printed output:

JavaCall.JavaCallError("Error calling Java: org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")

Stacktrace:
 [1] geterror(::Bool) at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/core.jl:294
 [2] geterror at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/core.jl:274 [inlined]
 [3] _jcall(::JavaCall.JavaObject{Symbol("org.apache.spark.sql.SparkSession$Builder")}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::Type, ::Tuple{}) at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/core.jl:247
 [4] jcall(::JavaCall.JavaObject{Symbol("org.apache.spark.sql.SparkSession$Builder")}, ::String, ::Type, ::Tuple{}) at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/core.jl:153
 [5] #SparkSession#8(::String, ::String, ::Dict{String,String}, ::Bool, ::Type) at /home_dir/xxxxx/.julia/packages/Spark/zK34P/src/sql.jl:25
 [6] (::getfield(Core, Symbol("#kw#Type")))(::NamedTuple{(:master, :enable_hive_support),Tuple{String,Bool}}, ::Type{SparkSession}) at ./none:0
 [7] top-level scope at In[3]:1

Fredo_XVII · April 3, 2019, 5:05am

There was to much error when I ran this. The platform kept rejecting the connection to Spark containers . It stop trying after 4 container fails.

Tried: sess = SparkSession(master="yarn", enable_hive_support=true)

Error:

Container exited with a non-zero exit code 1

Fredo_XVII · April 3, 2019, 5:12am

Tried: sess = SparkSession(master="local", enable_hive_support=true)

This actually makes a connection, launches the sparkui, but doesn’t register as a connection on the platform. The information in the sparkui doesn’t look normal.

When I try to run any query it just hangs and never finishes, and doesn’t error out.

Honestly, I have never run local when connecting to the cluster as it seems counter intuitive. Normally master = yarn, mode = cluster is the correct setting for the spark session.

Topic		Replies	Views
Setting up Julia on Spark on AWS EMR Julia at Scale question , spark , ijulia	24	2423	January 4, 2022
JavaCallError when using Spark.jl General Usage	1	203	June 18, 2023
not able to connect to Hive with Hive.jl General Usage package	0	338	May 23, 2019
What fileformat to use to load data for high performance computing Machine Learning	37	7002	December 1, 2018
When will Julia compete with Spark? Julia at Scale announcement , spark	16	8678	June 5, 2021

Connectiong to the Hive metastore on hdfs using Hive.jl or Spark.jl

Related topics