Intel MKL with Distributed fails

I’m attempting to use Intel MKL over multiple remote workers. I am using KissCluster.jl to set up a lightweight cluster on AWS and a machinefile to configure via

addprocs(machines, enable_threaded_blas=true, topology=:master_worker)

Each remote worker is using the same type image, and I can ssh into each one individually and run linear algebra without a problem (in other words, the MKL installation is fine) However, when I try to run pmap for a function over the workers I get the following error:

From worker 19:	/home/ubuntu/julia/usr/bin/julia: symbol lookup error: /opt/intel/compilers_and_libraries_2018.3.222/linux/mkl/lib/intel64/libmkl_intel_thread.so: undefined symbol: omp_get_num_procs

For all the workers. I found this issue (https://github.com/JuliaLang/julia/issues/27940) which had a similar problem, but they were able to resolve it.

I think that the issue is that my ~/.profile is not set correctly. Basically, I run source /opt/intel/bin/compilervars.sh intel64 on each instance as it spins up, but I’m worried that this doesn’t do the right thing when I use pmap. What should my ~/.profile look like? I tried setting the LD_LIBRARY_PATH manually in each instance by looking at LD_LIBRARY_PATH after source /opt/intel/bin/compilervars.sh intel64 is called, resulting in:

PATH="$HOME/bin:$HOME/.local/bin:$PATH"
LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.3.222/linux/compiler/lib/intel64:/opt/intel/compilers_and_libraries_2018.3.222/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.3.222/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2018.3.222/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.3.222/linux/mkl/lib/intel64_lin

What else should I try?

1 Like

Some more color:

I have tried to add the following to ~/.profile:

# bash
source /opt/intel/bin/compilervars.sh intel64

That works when I ssh into the instances, but for some reason when I tried to add those instances via addprocs, I get a source not found error, suggesting that line in ~/.profile is not being run as bash (?), and mkl does not load.

I also attempted to add the source command to the second line of the script from cat cloud_init_node_myc.sh in KissCluster to my launch script for the instances, according to https://github.com/pszufe/KissCluster, but then the nodes don’t connect to the cluster properly.

This turned out to be pretty simple. I simply had to add

source /opt/intel/bin/compilervars.sh intel64

to the top of my .bashrc file. Works like a charm now :slight_smile:

2 Likes