Connectiong to the Hive metastore on hdfs using Hive.jl or Spark.jl

Hello,

I am poc`ing Julia at work and I am trying to test it on our Hadoop environment. I have Julia 1.0.3 and Julia 1.0 installed. Running Julia through miniconda on our edge node.

Things that I have tried:

  • Using ODBC: I was able to connect to Hive from my laptop using ODBC.jl, but there was no performance improvement over using R, which I expected. This is great for working with tiny data for development that will run on the hadoop platform.
  • Using Spark: I was able to instantiate a Spark session on our platform using Jupyter, but I was not able to read any tables from our Hive environment. With SparkR, you can set “enableHiveSupport = TRUE” so that the SparkR session knows where to find the hive tables. I really would love to use SparkSQL to grab the data from Hive, this works really well with SparkR::sql().
  • Using Hive: Unlike Spark.jl, I was not able to connect to Hive with Hive.jl, or Elly.jl.

I am really interested in adopting at work, but without a proof of concept, our hadoop team will not implement julia on the platform. So for now I just want to be able to grab data out of our Hive databas environment.

Thank you!

Alfredo

I would think your best bet is to use Hive.jl. We’ve successfully used Hive.jl against reasonably large datasets, and it works well. However, given the many versions and configuration of Hive and Hadoop, something in your environment is probably incompatible with what Hive.jl expects. In particular, check what transport and auth your setup requires, and if Hive.jl supports it yet.

What errors do you see when you use Hive.jl?

Regards

Avik

At first glance, it also looks easy to fix hive support in Spark. If I make a change in a branch, will you be able to test it in your cluster?

Hello Avik,

From what I can tell you, we connect to hive through knox using LAN; we also have keberos for logging in. Could you provide the code that you use to make the connection to your hive instance (X out things that are sensitive)?

When I log in into our hadoop environement, I can type hive to launch a hive session, which means that Hive is already available. So what I need to do is get Julia to find the Hive session, launch a session from within Julia.

It is totally possible that I am thinking about this all wrong as I am not familiar enough with Julia to understand why it is breaking.

Alfredo

Hello dfdx,

I think so, if all it entails is to re-installing Julia. You would have to give me the path to the dev version of Julia I believe.

This would be great thank you!

Alfredo

Yes, I believe so. Please let me know when you want me to do the test.

Thanks!

Cool! I think I’ll have time tomorrow and will let you know.

Ok, Hive support is an optional dependency and should be plugged in during Spark.jl build. However, there seems to be no guide describing what’s the exact list of dependent libraries to add. I’m experimenting with options here, but it may take time.

Great! I am currently working with Julia 1.0.3 due to that being the current version in the anaconda package list, just fyi, on a linux platfrom (CentOS), 64bit.

I don’t know if this helps, but when I launch SparkR, I normally specify enable hive support = true…see below. This is how we get Spark to see the Hive tables (metastore/warehouse). As another note, I am a user of Julia/R/Hive, as I work in analytics, I am not a software engineer. In other words, I will be good for testing if things are working from a user perspective, which I am very excited about :slight_smile: .

# finds SparkR in hadoop
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
    
# Create a SparkR session...connecting to Spark from the edge node
sparkR.session(appName = "SuperCoolApp", master = "yarn", mode = "cluster", 
                sparkConfig = list(spark.driver.memory = "10g", spark.executor.memory = "22g", spark.driver.cores = "250",
                                   spark.rpc.message.maxSize = "1024"))
    
# finds Hive Metadata warehouse on hadoop
sparkR.session(enableHiveSupport = TRUE)

I think I’ve got it working, but setup may be tricky.

  1. Make sure SPARK_HOME environment variable is set and points to the location of your Spark installation on a driver machine. From your code above I suppose it does, but just in case.
  2. Make sure you have ${SPARK_HOME}/conf/hive-site.xml and it is correct configuration for your Hive cluster. Hadoop configuration may come in a variety of ways, if the file isn’t there, you may need help of your Hadoop administrator to locate it.
  3. Checkout hive-support branch of Spark.jl. From Jupyter notebook it should be something like:
using Pkg
Pkg.add("Spark#hive-support")
  1. Finally, create a Spark session with enable_hive_support=true and check your Hive tables:
using Spark
Spark.init()
sess = SparkSession(enable_hive_support=true)
sql(sess, "show tables")

Note that I don’t currently have complete Hive installation, so I only checked that the code loads and doesn’t produce errors until this point. If it’s successful for you, please let me know and I will merge the changes.

Hello,

So I tested the code provided on Julia 1.0.3 and Julia 1.1.0 on hour linux hadoop environment and I got the error below. It appears that it can’t find the dev package.

julia> using Pkg

julia> Pkg.add("Spark#hive-support")
ERROR: Spark#hive-support is not a valid packagename

Ah, turns out syntax for Pkg is a bit different. Please try this:

using Pkg
Pkg.add(PackageSpec(name="Spark", rev="hive-support"))

I got the Spark.jl to install. Thanks for the updated code. Code bombed on launching Spark. See code and output below:

julia> using Pkg

julia> Pkg.add(PackageSpec(name="Spark", rev="hive-support"))
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
   Cloning git-repo `https://github.com/dfdx/Spark.jl.git`
  Updating git-repo `https://github.com/dfdx/Spark.jl.git`
 Resolving package versions...
 Installed SoftGlobalScope ──── v1.0.10
 Installed Compat ───────────── v2.1.0
 Installed Tables ───────────── v0.1.18
 Installed WeakRefStrings ───── v0.5.8
 Installed TranscodingStreams ─ v0.9.3
  Updating `~/.julia/environments/v1.0/Project.toml`
  [e3819d11] ↑ Spark v0.4.0 ⇒ v0.4.0+ #hive-support (https://github.com/dfdx/Spark.jl.git)
  Updating `~/.julia/environments/v1.0/Manifest.toml`
  [34da2185] ↑ Compat v2.0.0 ⇒ v2.1.0
  [b85f4697] ↑ SoftGlobalScope v1.0.9 ⇒ v1.0.10
  [e3819d11] ↑ Spark v0.4.0 ⇒ v0.4.0+ #hive-support (https://github.com/dfdx/Spark.jl.git)
  [bd369af6] ↑ Tables v0.1.17 ⇒ v0.1.18
  [3bb67fe8] ↑ TranscodingStreams v0.9.0 ⇒ v0.9.3
  [ea10d353] ↑ WeakRefStrings v0.5.7 ⇒ v0.5.8
  Building Spark → `~/.julia/packages/Spark/zK34P/deps/build.log`


julia>

julia> using Spark
[ Info: Recompiling stale cache file /home_dir/xxxxx/.julia/compiled/v1.0/Spark/zpJEw.ji for Spark [e3819d11-95af-5eea-9727-70c091663a01]
ERROR: LoadError: LoadError: InitError: JavaCall.JavaCallError("Cannot find java library libjvm.so\nSearch Path:\n   /home_dir/xxxxx/.anaconda/bin")
Stacktrace:
 [1] findjvm() at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/jvm.jl:109
 [2] __init__() at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/JavaCall.jl:32
 [3] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:633
 [4] _require_from_serialized(::String) at ./loading.jl:684
 [5] _require(::Base.PkgId) at ./loading.jl:967
 [6] require(::Base.PkgId) at ./loading.jl:858
 [7] require(::Module, ::Symbol) at ./loading.jl:853
 [8] include at ./boot.jl:317 [inlined]
 [9] include_relative(::Module, ::String) at ./loading.jl:1044
 [10] include at ./sysimg.jl:29 [inlined]
 [11] include(::String) at /home_dir/xxxxx/.julia/packages/Spark/zK34P/src/Spark.jl:1
 [12] top-level scope at none:0
 [13] include at ./boot.jl:317 [inlined]
 [14] include_relative(::Module, ::String) at ./loading.jl:1044
 [15] include(::Module, ::String) at ./sysimg.jl:29
 [16] top-level scope at none:2
 [17] eval at ./boot.jl:319 [inlined]
 [18] eval(::Expr) at ./client.jl:393
 [19] top-level scope at ./none:3
during initialization of module JavaCall
in expression starting at /home_dir/xxxxx/.julia/packages/Spark/zK34P/src/core.jl:2
in expression starting at /home_dir/xxxxx/.julia/packages/Spark/zK34P/src/Spark.jl:48
ERROR: Failed to precompile Spark [e3819d11-95af-5eea-9727-70c091663a01] to /home_dir/xxxxx/.julia/compiled/v1.0/Spark/zpJEw.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1203
 [3] _require(::Base.PkgId) at ./loading.jl:960
 [4] require(::Base.PkgId) at ./loading.jl:858
 [5] require(::Module, ::Symbol) at ./loading.jl:853

Set a JAVA_HOME environment variable pointing to you JDK home directory.

So the Java_Home solved the install problem, but I am getting a lot of other errors. I need to clean them up before I can post them here. Just to give you a quick summary, the spark session fails after 3 executor attempts to connect.

Executor will fail to connect either to master, which is unlikely, or to the driver, i.e. Julia program you run the app from. Usually this means some kind of driver or cluster misconfiguration. The first thing to check is address that the executor tries to connect to. You can also try different options for master parameter, e.g.:

sess = SparkSession(master="yarn", enable_hive_support=true)
sess = SparkSession(master="yarn-cluster", enable_hive_support=true)
sess = SparkSession(master="local", enable_hive_support=true)

The first error that I get is when running the initial spark code below.

using Spark
Spark.init()

output:

  File "/usr/bin/hdp-select", line 205
    print "ERROR: Invalid package - " + name
                                    ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?

Tried sess = SparkSession(master="yarn-cluster", enable_hive_support=true)

Errors:

19/04/02 23:28:42 INFO spark.SparkContext: Running Spark version 2.3.1.xxx.xxx
19/04/02 23:28:42 WARN spark.SparkConf: spark.master yarn-cluster is deprecated in Spark 2.0+, please instead use "yarn" with specified deploy mode.
19/04/02 23:28:42 INFO spark.SparkContext: Submitted application: Julia App on Spark
19/04/02 23:28:42 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:378)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)
19/04/02 23:28:42 ERROR util.Utils: Uncaught exception in thread main
java.lang.NullPointerException
	at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$postApplicationEnd(SparkContext.scala:2389)
	at org.apache.spark.SparkContext$$anonfun$stop$1.apply$mcV$sp(SparkContext.scala:1904)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:1903)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:579)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)
19/04/02 23:28:42 INFO spark.SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:378)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2493)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:933)
	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:924)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:924)

Printed output:

JavaCall.JavaCallError("Error calling Java: org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")

Stacktrace:
 [1] geterror(::Bool) at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/core.jl:294
 [2] geterror at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/core.jl:274 [inlined]
 [3] _jcall(::JavaCall.JavaObject{Symbol("org.apache.spark.sql.SparkSession$Builder")}, ::Ptr{Nothing}, ::Ptr{Nothing}, ::Type, ::Tuple{}) at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/core.jl:247
 [4] jcall(::JavaCall.JavaObject{Symbol("org.apache.spark.sql.SparkSession$Builder")}, ::String, ::Type, ::Tuple{}) at /home_dir/xxxxx/.julia/packages/JavaCall/toamy/src/core.jl:153
 [5] #SparkSession#8(::String, ::String, ::Dict{String,String}, ::Bool, ::Type) at /home_dir/xxxxx/.julia/packages/Spark/zK34P/src/sql.jl:25
 [6] (::getfield(Core, Symbol("#kw#Type")))(::NamedTuple{(:master, :enable_hive_support),Tuple{String,Bool}}, ::Type{SparkSession}) at ./none:0
 [7] top-level scope at In[3]:1

There was to much error when I ran this. The platform kept rejecting the connection to Spark containers . It stop trying after 4 container fails.

Tried: sess = SparkSession(master="yarn", enable_hive_support=true)

Error:

Container exited with a non-zero exit code 1

Tried: sess = SparkSession(master="local", enable_hive_support=true)

This actually makes a connection, launches the sparkui, but doesn’t register as a connection on the platform. The information in the sparkui doesn’t look normal.

When I try to run any query it just hangs and never finishes, and doesn’t error out.

Honestly, I have never run local when connecting to the cluster as it seems counter intuitive. Normally master = yarn, mode = cluster is the correct setting for the spark session.